a methodology of error detection: improving speech ... · two heuristics involve statistical...
TRANSCRIPT
A METHODOLOGY OF ERROR DETECTION:
IMPROVING SPEECH RECOGNITION IN RADIOLOGY
by
Kimberly Dawn Voll
B.A., Simon Fraser University, 2001
a thesis submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
in the School
of
Computing Science
c© Kimberly Dawn Voll 2006
SIMON FRASER UNIVERSITY
Spring 2006
All rights reserved. This work may not be
reproduced in whole or in part, by photocopy
or other means, without the permission of the author.
APPROVAL
Name: Kimberly Dawn Voll
Degree: Doctor of Philosophy
Title of thesis: A Methodology of Error Detection:Improving Speech Recognition in Radiology
Examining Committee: Dr. Bob Hadley, Chair
Dr. Veronica Dahl, Senior Supervisor
Professor, School of Computing Science, SFU
Dr. Stella Atkins, Supervisor
Professor, School of Computing Science, SFU
Dr. Fred Popowich, Supervisor
Professor, School of Computing Science, SFU
Dr. Bruce Forster, Supervisor
Associate Professor, Department of Radiology, UBC
Dr. Maite Taboada, SFU Examiner
Assistant Professor, Department of Linguistics, SFU
Dr. Janice Glasgow, External Examiner
Professor, School of Computing, Queen’s University
Date Approved:
ii
Abstract
Automated speech recognition (ASR) in radiology report dictation demands highly accurate
and robust recognition software. Despite vendor claims, current implementations are sub-
optimal, leading to poor accuracy, and time and money wasted on proofreading. Thus,
other methods must be considered for increasing the reliability and performance of ASR
before it is a viable alternative to human transcription. One such method is post-ASR error
detection, used to recover from the inaccuracy of speech recognition. This thesis proposes
that detecting and highlighting errors, or areas of low confidence, in a machine-transcribed
report allows the radiologist to proofread more efficiently. This, in turn, restores the benefits
of ASR in radiology, including efficient report handling and resource utilization.
To this end, an objective classification of error-detection methods for ASR is established.
Under this classification, a new theory of error detection in ASR is derived from the hybrid
application of multiple error-detection heuristics. This theory is contingent upon the type of
recognition errors and the complementary coverage of the heuristics. Inspired by these prin-
ciples, a hybrid error-detection application is developed as proof of concept. The algorithm
relies on four separate artificial-intelligence heuristics together covering semantic, syntactic,
and structural error types, and developed with the help of 2700 anonymised reports obtained
from a local radiology clinic. Two heuristics involve statistical modeling: pointwise mutual
information and co-occurrence analysis. The remaining two are non-statistical techniques: a
property-based, constraint-handling-rules grammar, and a conceptual distance metric rely-
ing on the ontological knowledge in the Unified Medical Language System. When the hybrid
algorithm is applied to thirty real-world radiology reports, the results are encouraging: up
to a 24% increase in the recall performance and an 8% increase in the precision performance
over the best single technique. In addition, the resulting algorithm is efficient and modular.
iii
Also investigated is the development necessary to turn the hybrid algorithm into a real-
world application suitable for clinical deployment. Finally, as part of an investigation of
future directions for this research, the greater context of these contributions is demonstrated,
including two applications of the hybrid method in cognitive science and machine learning.
Keywords
medical informatics, automatic speech recognition, natural language processing, hybrid
error detection, computer-assisted editing, radiology reporting
iv
To Curiosity...
v
“Not all who wander are lost.”
— J.R.R. Tolkien
vi
Acknowledgments
The road was longer and harder than it promised, but in the end I persevered. For those
who have helped me along the way, know you have a place in my heart forever warmed by
my gratitude.
So here I say thank you to...
• The Sun Hang Do family, for the many years of stress relief, good times, and friendship.
In particular, I would like thank Grand Master Kang, Mrs. Kang, the Janzen brothers,
the Fisher and Tsui family, as well as Zofia, Tammy, Richard, Kelvin, Anna, Annie
and the entire Coquitlam “gang”.
• The NSERC Postgraduate Awards Program and Simon Fraser University for ensuring
the best possible funding throughout my graduate career.
• Phinished.org, and in particular, Tom and my fellow “phinishers”.
• The computing science office and tech staff: we’d be lost without you guys. An extra
special thank you goes out to Val Galat for her kindness and swift E-mail skills, which
both contributed to the preservation of my sanity.
• Glendon, my Unix/Mac guru, thank you for your endless patience.
• The Spring 2005 COGS 100 class; you guys rocked.
• Ken MacAllister, for your useful comments on early portions of this work.
• The “Logic and Functional Programming Lab” as well as the “Natural Language Lab”
for your constructive comments, encouragement, and endless supply of fascinating
conversation. Thanks, in particular, go to Maryam, Dulce, Baohua, Jiang, Chris, and
Wendy.
vii
• Dr. Diana Cukierman, for finding the time to help with my formalization and deeper
understanding of set theory, despite being up to your eyeballs in your own work.
• The Canada Diagnostic Centre, for welcoming me into your clinic, and sharing with
me your resources.
The following professors deserve special mention for their patience, guidance, and support,
but most importantly their kindness. You have all helped build a more capable and confident
researcher:
• Dr. Bob Hadley and Dr. Bill Havens, for your help in guiding me down the path of
research.
• Dr. Nancy Hedberg, for your unending enthusiasm and help over the years.
• Dr. Maite Taboada, for your gentle, always-helpful advice over the years. I’m hon-
oured to have you as my internal examiner.
• Dr. Janice Glasgow, for flying all way out here to sit on my examining committee and
for your thoughtful and kind comments about my work.
And in particular, my supervisory committee:
• Dr. Stella Atkins, for encouraging and challenging me from the very moment we met.
I would not be here today if it was not for your medical computing class.
• Dr. Bruce Forster, my radiology expert, for the fascinating conversations, support,
and kindness. Thank you for taking the time to show me the world of radiology.
• Dr. Fred Popowich, for your energy, your incredibly positive attitude, and most
importantly your faith in me.
• Dr. Veronica Dahl, for your mentorship, support, and friendship that saw me through
the many ‘ups’ and ‘downs’ of my graduate career. I never doubted that you cared.
I wish to say thank you to my wonderful support network of friends:
• The “shore-line gang”, starring in alphabetical order: Annavie, Benny, Carl, Chantel,
David, Eileen, Eric, Kyle, Liz, and Patrick. For all the fun over the years.
viii
• Catriona, for the runs, the coffee breaks, and all the wonderful company. I am glad
to call you my friend.
• Aki, for all the great MSN chats, both goofy and serious. I’m so happy that we are
back in touch.
• Alma (“Dr. Clam”), my sweet and fun-loving friend, for all the fun, advice, encour-
agement, and commiseration.
• Mark, my academic kindred spirit, for the long walks and the long talks on just about
anything.
• Rob, for “forcing” me to play all those board games (thanks, dude).
• My dear friend Katie and her fantastic wife, Krista, for being my official “stress-relief
committee”; Katie, thank you for the many years of great friendship.
• Chris, my very dear friend, whose shoulder was always there when I most needed it
(and who frightens me in his keen understanding of my twisted sense of humour), for
just being you.
I thank my family for their patience, encouragement, and endless faith in me:
• Jennifer and David, for the great company and unwavering support.
• Fiona and Rob, for everything (but most importantly the lattes). You guys are the
best.
• Carolynn (and her beautiful family), for all the laughs and for all the love. You may
not be my sister by family, but you are by choice.
• My grandparents, for your love and support over the years.
• Brian for being such a cool guy, but more importantly, for being not only my brother
but my friend. (P.S. Mom still loves me more.)
In particular, I am blessed with three wonderful parents for whom I wish to thank for the
gift of life, love, friendship, good sense, and good times:
• Dad, for always giving me everything, even in the face of adversity. I am proud to be
your daughter.
ix
• Denny, for the love, support, and endless laughs. I am honoured to call you my
stepdad: how many daughters have two amazing dads who equally brighten their life?
• Mom, my angel, my confidant, my strongest supporter, I give you this quote:
“A mother is the truest friend we have, when trials heavy and sudden, fall
upon us; when adversity takes the place of prosperity; when friends who
rejoice with us in our sunshine desert us; when trouble thickens around us,
still will she cling to us, and endeavor by her kind precepts and counsels to
dissipate the clouds of darkness, and cause peace to return to our hearts.”
— Washington Irving.
And finally...
• Ian, the most patient of them all... thank you for everything.
x
Contents
Approval ii
Abstract iii
Dedication v
Quotation vi
Acknowledgments vii
Contents xi
List of Tables xvi
List of Figures xvii
1 The Thesis 1
1.1 The Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Main Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Introduction to the Primary Research Problem . . . . . . . . . . . . . . . . . 4
1.4.1 ASR in the Reading Room . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Extant Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Beyond Radiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Canonical Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
xi
2 An Introduction to Medical Language Processing 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Medical Language Processing . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 General Challenges in MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Medical Language Processing in Radiology . . . . . . . . . . . . . . . . . . . . 14
2.3.1 The Radiology Environment . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 The Radiology Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Improving Radiology Reporting . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Automated Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Natural Language Understanding in Medicine . . . . . . . . . . . . . . . . . . 20
2.5 The Needs of the Radiologist . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Limitations of an Imperfect System . . . . . . . . . . . . . . . . . . . 21
2.6 Pushing the State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.1 Overcoming Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 A Classification of Error-Detection Methods 24
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 The Stages of Error Handling in Speech Recognition . . . . . . . . . . 24
3.1.2 On the Nature of Recognition Errors . . . . . . . . . . . . . . . . . . . 25
3.2 A Brief Introduction to Automatic Speech Recognition . . . . . . . . . . . . . 28
3.2.1 Recognizing Human Speech . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Natural Language Understanding . . . . . . . . . . . . . . . . . . . . . 31
3.3 Confidence Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 A Classification of Error-Detection Methods for Speech Recognition . . . . . 33
3.4.1 The Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Non-Black-Box Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.1 Probabilistic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.2 Non-Probabilistic Approaches . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.3 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Black-Box Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6.1 Probabilistic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.2 Non-Probabilistic Approaches . . . . . . . . . . . . . . . . . . . . . . . 42
xii
3.6.3 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 A Note on Stop Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 A Conceptual Model 47
4.1 The General Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Introducing A Hybrid Approach to Error Detection . . . . . . . . . . . . . . . 49
4.3 A Note on the Measure of Correctness . . . . . . . . . . . . . . . . . . . . . . 53
4.4 The Error-Detection Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.1 Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.2 Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.3 Word Occurrence Probabilities and “N-gram” Models . . . . . . . . . 59
4.5 A Formalization of the Hybrid Approach to Error
Detection in Radiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.1 General Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.2 The Error-Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . 69
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Experimental Evidence 75
5.1 Introduction to Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.1 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.1 Modular Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.2 Calculating Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.3 Aligning the Source and Output: Recognition Errors . . . . . . . . . . 78
5.2.4 Calculating Co-Occurrences . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.5 The Error-Detection Algorithms . . . . . . . . . . . . . . . . . . . . . 80
5.2.6 Conceptual Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.7 Semantic Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.8 Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.9 Word Occurrence Probabilities . . . . . . . . . . . . . . . . . . . . . . 89
5.2.10 Comparing Co-occurrence Analysis and PMI . . . . . . . . . . . . . . 100
5.3 A Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
xiii
6 Observations and Corollaries 103
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 The Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.1 The Hybrid Error-Detection Methodology . . . . . . . . . . . . . . . . 103
6.2.2 On the Nature of Report Errors . . . . . . . . . . . . . . . . . . . . . 105
6.2.3 General Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3 From a Radiologist’s Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4 A Critical Look at the Hybrid Error-Detection
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4.1 Challenges Facing the Hybrid Methodology . . . . . . . . . . . . . . . 109
6.4.2 Challenges Facing the Current Implementation . . . . . . . . . . . . . 112
6.5 Corollaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.5.1 Immediate Implications . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.5.2 Implications for Future Study . . . . . . . . . . . . . . . . . . . . . . . 114
6.6 A Standalone Application for the Radiology Workstation . . . . . . . . . . . 115
6.6.1 Steps to an Independent System . . . . . . . . . . . . . . . . . . . . . 115
6.6.2 User Interface for the Hybrid Error-Detection System . . . . . . . . . 117
6.6.3 Miscellaneous Requirements . . . . . . . . . . . . . . . . . . . . . . . . 118
6.7 Measuring the Real-World Success of the System . . . . . . . . . . . . . . . . 118
6.8 Data Sparseness: Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.9.1 The Full System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.9.2 Immediate Extensions: Improving the Current Heurisitcs . . . . . . . 122
6.9.3 Miscellaneous Improvements . . . . . . . . . . . . . . . . . . . . . . . 124
6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7 Beyond Radiology 127
7.1 Error Detection in the Greater Context . . . . . . . . . . . . . . . . . . . . . 127
7.1.1 The Methodology in Other Domains . . . . . . . . . . . . . . . . . . . 127
7.2 Cognitive Science Perspectives on Error Detection . . . . . . . . . . . . . . . 128
7.2.1 Error Detection: Applications in Neuro- and Psycholinguistics . . . . 129
7.2.2 Error Detection and Language Acquisition . . . . . . . . . . . . . . . . 131
7.3 Quality Control in NLP Applications . . . . . . . . . . . . . . . . . . . . . . . 133
xiv
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8 Conclusions 138
A Glossary of Medical and Non-Medical Terms 141
A.1 Radiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.2 Computational Linguistics/ Knowledge Representation . . . . . . . . . . . . . 143
A.3 Automated Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A.4 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B Ontologies in Healthcare 149
B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
B.1.1 Controlled Medical Vocabulary . . . . . . . . . . . . . . . . . . . . . . 149
B.1.2 Semantic Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
B.1.3 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
B.1.4 The Continuum of Knowledge Representation . . . . . . . . . . . . . . 151
B.1.5 Principles of Good Ontologies . . . . . . . . . . . . . . . . . . . . . . . 153
B.2 Methods of Knowledge Representation . . . . . . . . . . . . . . . . . . . . . . 157
B.2.1 First Order Predicate Calculus (FOPC) . . . . . . . . . . . . . . . . . 158
B.2.2 Semantic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
B.2.3 Frame-Based Representations . . . . . . . . . . . . . . . . . . . . . . . 159
B.2.4 Description Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
B.3 Medical Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B.3.1 Existing Vocabularies and Ontologies . . . . . . . . . . . . . . . . . . 162
B.3.2 Issues in Medical Informatics/Ontologies in General . . . . . . . . . . 170
B.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
C All Results 174
Bibliography 178
xv
List of Tables
3.1 An example of the usefulness of co-occurrence relations in determining simi-
larity between documents and queries [96, Page 554] . . . . . . . . . . . . . . 40
5.1 Co-occurrence statistics for “quadriceps”. . . . . . . . . . . . . . . . . . . . . 79
5.2 CHR parser results on all error types. . . . . . . . . . . . . . . . . . . . . . . 88
C.1 Co-occurrence analysis with windowsize=3, threshold=0. . . . . . . . . . . . . 174
C.2 Co-occurrence analysis on entire error set, windowsize=collocation . . . . . . 174
C.3 Co-occurrence analysis on non-stop-words only, windowsize=collocation . . . 175
C.4 Co-occurrence analysis on entire error set, windowsize=1 . . . . . . . . . . . 175
C.5 Co-occurrence analysis on non-stop-words only, windowsize=1 . . . . . . . . 175
C.6 Co-occurrence analysis on entire error set, windowsize=10 . . . . . . . . . . . 175
C.7 Co-occurrence analysis on non-stop-words only, windowsize=10 . . . . . . . . 175
C.8 PMI analysis on entire error set, windowsize=collocation . . . . . . . . . . . . 176
C.9 PMI analysis on non-stop-words only, windowsize=collocation . . . . . . . . . 176
C.10 PMI analysis on entire error set, windowsize=1 . . . . . . . . . . . . . . . . . 176
C.11 PMI analysis on non-stop-words only, windowsize=1 . . . . . . . . . . . . . . 176
C.12 PMI analysis on entire error set, windowsize=10 . . . . . . . . . . . . . . . . 177
C.13 PMI analysis on non-stop-words only, windowsize=10 . . . . . . . . . . . . . 177
C.14 Combined heuristics on all errors based upon top f-measure. . . . . . . . . . . 177
C.15 Combined heuristics on all errors based upon top recall score. . . . . . . . . . 177
xvi
List of Figures
2.1 Typical radiology workstation. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 The relevant, overlapping error levels in radiology. . . . . . . . . . . . . . . . 26
3.2 The noisy channel model, based on Jurafsky and Martin, Figure 7.1, [81, page
237]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 The abstract hybrid system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 A Venn diagram showing the similarities between ER and AE. . . . . . . . . 73
5.1 CA results based upon report type. . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 CA recall results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 CA precision results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . . 94
5.4 CA f-measure results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . . 94
5.5 PMI recall results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . . . . 98
5.6 PMI precision results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . . 99
5.7 PMI f-measure results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . 99
5.8 PMI versus Co-occurrence Analysis (COA). . . . . . . . . . . . . . . . . . . . 100
5.9 Combined heuristics on all errors based upon top f-measure (overall perfor-
mance). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1 The error detection process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 Sample output using a grey-scale confidence indication. . . . . . . . . . . . . 117
6.3 The full system as envisioned. . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
B.1 The Knowledge Continuum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
xvii
Chapter 1
The Thesis
1.1 The Thesis
Post-speech-recognition, hybrid error detection is an effective means to recover from low
recognition rates in radiology report dictation. In the following pages I will define precisely
“post-speech-recognition, hybrid error detection” and present the necessary evidence in
defense of this thesis, along with the contributions arising as a direct result of my research.
This includes applications extending beyond radiology to other domains, demonstrating the
wider context of this work.
This chapter provides an introduction and brief overview of the entire dissertation, in-
cluding a summary of the motivations, research questions, and hypotheses, as well as the
resulting contributions.
1.1.1 Summary of Contributions
This dissertation comprises four original contributions to the general problem of error de-
tection in natural-language text:
• A classification of error-detection methods for speech recognition.
• A hybrid error-detection methodology.
• A successful proof of concept applying the hybrid methodology to radiology report
dictation.
• Two theoretical applications of the technology beyond the domain of radiology.
1
CHAPTER 1. THE THESIS 2
In addition, other possible applications of the hybrid methodology are explored showing
its wide employ and underscoring the relevance of this contribution.
1.2 Motivation
Medical informatics is the study of information as it pertains to medicine. This notion
includes an impressive array of knowledge tasks, including representation, storage and re-
trieval, communication across information systems, and standards development. These are
applied to a wide range of medical tasks, such as management and billing, electronic pa-
tient records, and automated diagnosis. Equally important is the interface between people
and information. This interface can take many forms, from simple text dictation to more
advanced query engines that rely on complicated knowledge representation.
The field of Medical Language Processing (MLP) sits at the intersection of medical
informatics1 and natural language processing. The ultimate aim is the seamless integra-
tion of medical information management with a natural language interface. Users are able
to communicate with the technology in their native tongue, as opposed to learning an
artificial language or alternative interface. The goal is minimized training requirements,
improved integration, and easier handling. This translates into increased acceptance of
medical-informatics technology and a greater willingness to adapt on the part of clinicians.
Indeed, user acceptance can be the sole determining factor in the survival of new technology
[97].
One of the ways in which MLP has influenced medicine is the introduction of automated
speech recognizers to medical dictation. The hope is to provide a hands-free means of record-
ing medical information, unchaining the physician from his pen and paper and providing an
electronic means of cataloguing information. This is particularly appropriate in the radi-
ology reading room, where radiologists routinely examine radiological imagery and dictate
their findings into a recording device. These reports are later transcribed and returned to a
radiologist for approval: a process that in its entirety can take days or more. With the onset
of automatic speech recognition (ASR) technology, however, the promise of truly hands-free
dictation and efficient reporting seems not far off. ASR can offer improved patient care
and resource management in the form of reduced report turnaround times (TATs), reduced
1This is sometimes extended to the wider discipline of bioinformatics.
CHAPTER 1. THE THESIS 3
staffing needs, and the efficient completion and distribution of reports [72, 99]. The ability
to immediately revise a report also means that the radiologist will be freshly familiar with
the case. This translates directly into improved patient care.
In some radiology clinics, such benefits coupled with recent improvements in ASR tech-
nology have motivated the introduction of automated transcription software in lieu of human
stenographers2. Yet as the technology comes of age, with vendors claiming accuracy rates
as high as 99%, the potential advantages of ASR over traditional dictation methods are not
being realized, leaving many radiologists frustrated with the technology [97, 52].
1.3 Main Research Questions
In light of the problems of ASR in the radiological setting, the following research questions
are put forth:
• How can the accuracy of speech recognition in radiology be improved?
• What is the current state of post-recognition error detection?
• How can the current state of error detection be improved, and applied to the problem
of radiology report dictation?
• What is the nature of recognition errors?
• What is needed for a general theory of error-detection methods as they relate to speech
recognition?
• How can this knowledge be combined into an error-detection methodology?
• What are the implications for this error-detection theory and methodology beyond
radiology?
The following chapters will present an in-depth look at each of these research questions.
2Also known as transcriptionists.
CHAPTER 1. THE THESIS 4
1.4 Introduction to the Primary Research Problem
The primary reason behind the apparent failure of ASR in radiology is accuracy. A 99%-
accurate speech recognizer still averages one error out of every hundred words, with no
guarantee as to the seriousness of such errors. Furthermore, actual accuracy rates in the
reading room often fall short of 99%. Radiologists are instead forced to maintain their
transcriptionists as correctionists, or to double as copy editors, painstakingly correcting
each case, often for nonsensical or inconspicuous errors. Not only is this frustrating, but it
is a poor use of time and resources. To compound matters, problems integrating with the
radiology suite and the introduction of delays have further soured many radiologists on the
technology. Those choosing to modernize their reading rooms with ASR software are often
plagued with difficulties, while those continuing to use traditional reporting methods have
no incentive to upgrade.
Within medicine the problem of accuracy is particularly insidious as the ramifications
of errors within a report can have serious consequences. According to the U.S. Institute of
Medicine, as many as 98,000 people in the United States die annually from medical errors
[104].
This section examines the integration of ASR into the existing report-dictation process
within the radiology reading room (the setting where images are interpreted by radiologists).
1.4.1 ASR in the Reading Room
Like an assembly line, radiology reporting relies on the order and completion of certain
events to run smoothly:
1. Physician submits exam requisition for a patient.
2. Patient is scanned and a radiograph (image) is generated.
3. Radiologist interprets the radiograph and simultaneously dictates his report into a
recording device in the radiology reading room.
4. The recording is added to the stenographer’s transcription queue.
5. The report is transcribed by the stenographer.
6. Lastly, the transcribed report is returned to a radiologist for final approval and sent
on to the requesting physician.
CHAPTER 1. THE THESIS 5
The time it takes for the above process to complete is referred to as the report turnaround
time (TAT). The ultimate goal of ASR is to improve the TAT as well as the reporting
process by removing steps four and five, while enhancing the radiologist’s experience in step
three. Instead of waiting in a queue of reports to be transcribed, using ASR a dictated
report is immediately transcribed, proofread, signed off, and sent to the referring physician.
Although theoretically more efficient, the accuracy rate of existing ASR technology results
in time wasted on painstaking corrections and poor interface design.
In contrast, human stenographers are highly trained and familiar with radiological par-
lance. Errors or ambiguous areas of the recording can be actively clarified with the radiolo-
gist. Thus, when it comes time to sign off on the report, the radiologist need only perform
a quick skim to confirm that everything is in order. If the stenographers are replaced by
speech recognizers, though, this revising now must fall on the radiologist himself. Not only
is this frustrating for the radiologist, who wishes to focus on image interpretation, but it is
a poor use of the highly paid radiologist’s time. As an answer, some have suggested hir-
ing correctionists, however this effectively re-invents the role of transcriptionist and negates
most benefits over traditional dictation.
Since it is unlikely that speech recognizers will achieve 100% accuracy any time in the
near future, especially within medicine, the overarching research question is how to make the
technology work in the present. By creating a means for post-recognition analysis, some of
the burden of transcription can be shifted to systems tasked with more in-depth processing
of the information contained in the text. Such processing can allow a level of “damage
control” in the form of error detection and ultimately error correction. Errors made at the
ASR level can be detected in an auxiliary system that sits between the physician and the
dictation system. This can manifest itself in several ways: from a simple detection system
that allows the radiologist to efficiently skim the reports for tagged errors, or areas of low
confidence; to a full error correction system that corrects the text based on an advanced
analysis of the contents. If efficiently designed and seamlessly integrated, the time spent
proofreading is significantly reduced and the benefits of speech recognition over traditional
transcription are regained. In addition, corrected text lends itself to further MLP processing.
An advantage of processing post-dictation over direct integration with ASR is the ability
to detect errors that may have been mistakenly introduced by the radiologist, in addition
to ASR errors. Moreover, what researchers may lose in not integrating directly with ASR is
arguably gained in software independence – a post-ASR system is not bound to a particular
CHAPTER 1. THE THESIS 6
speech recognizer and therefore can be readily modified and updated, and used in any clinic.
In addition, proprietary software restrictions on speech recognizers often make it challenging
for researchers to integrate their technology. As Jeong et al observe, “If the speech recognizer
can be regarded as a black-box, we can perform robust and flexible domain adaptation
through the post error correction process.” [77].
In summary, the primary research problem is the development of an intelligent error
detection (and ultimately correction) system for radiology reporting that is sensitive to the
domain, and capable of capturing the sorts of errors made by speech recognizers and the
radiologists themselves.
1.5 Extant Work
While not a new problem, post-recognition error detection has never been applied to ASR
in radiology reporting. Previous work has focused on dialogue systems, giving rise to a
variety of error-detection techniques that could be extended to radiology report dictation.
This section offers a brief introduction to the status of error detection and other relevant
research areas. It is expanded upon in Chapter 3.
Most approaches to error detection have involved statistical approaches that rely on
N-gram models and pattern matching. By collecting co-occurrence statistics on each word
in the relevant corpus, it is possible to establish a list of context words that have a high
probability of occurring near a particular word. When an error is detected in a string, its
context is matched to the database and the corresponding corrected text is substituted.
Both Kaki et al [82] and Sarma et al [131] rely on co-occurrence statistics to determine
the likelihood of a recognition error. By analysing the context of a given target word, it
is possible to determine if the words in that context “match” the target word or another
recognition candidate better. Some researchers experimented with expanding the target
word to include one or more of the surrounding words [1, 121]. This target tuple is then
compared to the context window. Still other researchers have broken the target word up
into component syllables to capitalize on sub-syllabic features [77].
Alternative approaches to error detection involve language modeling, such as the noisy
channel model, where acoustic input is treated as a “noisy” version of the source sentence
and correspondingly “decoded” in an effort to find the “true” underlying utterance. All
possible utterances are considered as a match for the noisy input, and the one with the
CHAPTER 1. THE THESIS 7
highest probability is then selected [81].
From a semantic perspective, Jeong et al have experimented with lexico-semantic tem-
plates based on abstractions of particular word sequences found in the training data [77].
When an error is suspected, queries are matched to templates; templates with the minimum
distance from the query are selected as replacement candidates.
Outside of work explicitly in error detection, conceptual similarity offers promise as a
means for detecting errors of semantic origin. Given an ontology3, it is possible to determine
the distance between any two concepts, either directly via edge-distance, or more complex
measures involving, for example, information content [23, 103, 78, 118].
In all cases, the performance of these error-detection methods is contingent upon the
type of error made. That is, their coverage of error types varies. For example, concep-
tual similarity may detect errors of semantic origin, but may not detect a sentence that is
ungrammatical.
1.6 Beyond Radiology
In addition to the focus on the problem of ASR accuracy in radiology, the work presented
here will be shown to have application in the greater context of natural language processing
(NLP), and beyond, to problems of assessing language in cognitive science.
The qualitative assessment of text in NLP has been a recent focus in the literature,
particularly as it pertains to machine translation. Systems such as BLEU and ROUGE have
been popular approaches to assessing text quality on the basis of similarity to a reference
document [108, 34].
Furthermore, automated methods for assessing language acquisition and pathology can
help lead the way to technology for rehabilitation, and a greater understanding of the brain
with respect to language processing.
1.7 Hypotheses
It is now possible to offer the following hypotheses, to be addressed in the coming chapters:
3See Appendix 2 for an in-depth look at ontologies in healthcare.
CHAPTER 1. THE THESIS 8
• As a post-processing stage, methods in medical language processing can effectively
detect recognition errors in radiology reports dictated via ASR.
• Combining complementary methods of error detection results in improved sensitivity
to report errors.
• Tagging erroneous reports based on the quality of their output can avoid the need for
an in-depth re-read of the report.
• Post-recognition error detection is an effective means to improve ASR in radiology
reporting.
• Post-recognition error detection has applications beyond radiology reporting.
1.8 Canonical Organization
This dissertation is arranged as follows. Chapter 1 has laid the groundwork, including the
introduction and motivation for the research described in the remaining chapters. Chapter
2 presents a general introduction to the field of medical language processing, placing the
intended application and proof of concept in the general context, as well as offering motiva-
tion for error detection and analysis within radiology. Chapter 3 introduces the classification
of error-detection methods within speech recognition, providing an objective framework in
which the conceptualization of the hybrid methodology can be integrated in Chapter 4. In
Chapter 5, this conceptualization is exemplified through proof of concept in the intended
domain, namely radiological report dictation, while the ramifications of this application and
the surrounding research are presented in Chapter 6. Although these contributions grew out
of the research on improving ASR in radiology, it is found that the methods and theories
share greater employ beyond MLP. Thus, to demonstrate this greater context, Chapter 7 ex-
plores two major applications of the methodology in cognitive science and natural language
processing in general, and sets the stage for future research. Finally, Chapter 8 summarizes
the research and contributions.
Three appendices are provided for the convenience of the reader. The terminology used
throughout this document follows the definitions provided in Appendix A, when not provided
in the main body of the text. Appendix B is an introduction to ontologies in healthcare and
CHAPTER 1. THE THESIS 9
provides additional information supporting the choice of ontology in Chapter 4. Finally,
Appendix C provides all of the experimental results from Chapter 5.
Chapter 2
An Introduction to Medical
Language Processing
2.1 Introduction
Since the late 1950s, medicine has been attracting researchers in artificial intelligence (AI)
[2]. Initially, medical diagnosis was a primary focus due to its highly structured reasoning
tasks, and met with reasonable success in terms of performance. Despite this, clinicians were
nonetheless displeased with the resulting technology; from their perspective, the task of re-
entering data into a computer that was already in a patient’s paper chart seemed redundant
and a waste of their time. Consequently, the focus of AI in medicine began to shift from
the comparatively simple task of diagnosis to the challenge of automated data-acquisition,
natural language processing (NLP), and knowledge representation. Together these comprise
many of the research goals of medical language processing, or MLP1.
From a language-processing perspective, medicine is an ideal research area. Although
the medical domain is expansive, it remains a constrained domain with a large corpus of
literature suitable for MLP research. Furthermore, limited human resources, high report
turnaround times (TATs) of days or more, as well as an ever-increasing need to improve the
cost-benefit ratio have greatly increased the attraction of automated systems that enhance
medical document handling.
1Readers wishing a more in-depth introduction to natural language processing than is possible here arereferred to Jurafsky and Martin [81], and Manning and Schutze [96].
10
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 11
2.1.1 Medical Language Processing
The task of mapping natural language into the requisite medical terminology is no small feat.
While MLP reduces the domain of language to medicine, the nonetheless large vocabulary,
along with the tendency of medical professionals to use incomplete sentences, ensures the
domain is no trivial one: the scarcity of successful, comprehensive MLP systems currently
deployed in real clinical settings is testament to this. As a result, studies in mapping medical
free text into structured, machine-readable documentation have typically focused on one or
more sub-domains of medicine, such as radiology.
Researchers have been working to incorporate automatic speech recognition (ASR) in
radiology reporting for many years. Despite what may seem a natural pairing, there re-
mains a great deal of research before the technology is suitable for wide-scale use. Problems
with integration in the radiology environment, accuracy, and the introduction of delays have
soured many radiologists on the technology. Nevertheless, the alluring potential for signif-
icant gains in efficiency, and greater overall data interactivity is leading many to speculate
that ASR is yet the way of the future.
In addition to ASR, there is currently research on the automated interpretation of radi-
ology reports. Modern methods in radiology reporting leave large amounts of information
effectively “inaccessible” in the form of free text reports. “Free text” refers to unrestricted,
freely dictated reports; in contrast with “structured” reports, where the radiologist is con-
fined to pre-formatted report with restrictions such as word count. While free-text reporting
allows the radiologist more freedom and is generally the more common format for dictation,
it is difficult to search, analyze or even summarize the information contained within the
resulting text report. In an effort to overcome these challenges, and to improve patient care
overall, systems that automatically interpret free-text reports and translate them into a
structured, machine-readable format are being developed, such as the MedLEE system [49].
The benefits of these automated interpretation and summarization systems coupled with
ASR are numerous. In hospitals, this includes improved overall hospital efficiency, reducing
report TATs from days to a few hours or less; enhanced patient monitoring2; and improved
data storage. As well, since natural language is the communication medium, MLP systems
are theoretically easier to use and require less training time than other interfaces. Moreover,
2For example, patient charts and records can be automatically scanned for potential drug interactionsthat may have been overlooked.
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 12
radiographs can be made available throughout the hospital as soon as they are complete,
or even remotely via the Internet. When summarized, the information in a report becomes
accessible, meaning that clinicians can not only access past cases with greater efficiency, but
that these reports can now be analyzed by computer. The result is radiological data that
is useful not only to clinicians, but to researchers, statisticians, and decision support teams
as well, and a more efficient and cost-effective environment.
Some examples of MLP technology include:
• Intelligent searching (not only medical records and patient reports, but the Internet
as well);
• Decision support;
• Diagnosis;
• Automated structuring of free-text reports (natural language understanding);
• Speech recognition of dictated reports.
These technologies are applicable throughout the medical world, including a wide variety
of hospital departments.
2.2 General Challenges in MLP
In the past, medicine has been criticized as the only major industry still relying on hand-
written documentation [2, page 70]. Although progress is being made, the transition from
hand-written to computerized documentation is challenged by the need for a standardized
terminology, and a means for mapping natural language into this terminology. Several
projects are currently underway for the development of extensive terminologies and ontolo-
gies for just this purpose. Examples include the UMLS, SNOMED, and GALEN lexicons,
which are discussed in more detail in Appendix B.
Perhaps the greatest challenge in language processing is ambiguity in the input. Am-
biguity arises when there exists more than one interpretation for a given statement, due
to the structure, syntax, or semantics of the expression. For instance, in the sentence “the
cat saw the dog on the mat”, there is more than one answer to the question “who is on
the mat?”. Other sources of ambiguity are lexical units that have more than one semantic
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 13
interpretation; for instance, “scarf” can refer to the knitted item worn around one’s neck,
or, it can refer to the verb, as in to “scarf” one’s food. In the case of speech recognition,
homonyms (words or phrases that sound the same, but are semantically distinct) also be-
come an issue, such as “aisle” and “I’ll”. Here the ASR system is forced to guess which
lexeme is appropriate based on the context. If the choice is between a verb and a noun, this
can be an relatively easy problem to solve based on grammaticality, however, if all of the
variations are the same part of speech, the problem is more challenging.
Qualification also affects MLP. Consider the phrases “possible cardiomegaly”, “heart
may be enlarged”, or “heart is probably enlarged”; in each instance a slightly different
qualification of the condition of the heart is provided, yet the meanings are extremely similar.
As Rector observes, “[h]uman users may be able to recognize that these are essentially the
same, but the rules for doing so must be made explicit to be usable by the computer” [114,
page 245]. The challenge is recognizing when it is necessary to capture differing qualifications
as distinct, and how best to represent this information when relevant.
Negation is a similar problem. While testing for the presence of a negating word such
as “no” may seem relatively straightforward, the challenge lies in the multitude of ways in
which negation can be expressed, and in determining the scope of negation. For example,
consider the differences between “pneumonia is not present” and “no pneumonia”; both
could be erroneously classified as indicative of pneumonia without the ability to accurately
detect negation.
Parsing coordination also causes difficulties within MLP. Part of this difficulty may be
blamed on the flexibility of English, and many other languages, in coordinating structures.
For instance, in English, any two constituents (i.e. noun phrase, verb phrase, et cetera),
even of differing kind, can be joined in a coordinating structure [155]. The sentence “John
was rich and a doctor” sees the conjunction of the adjectival phrase “rich” with the noun
phrase “a doctor”. Furthermore, it is often the case that information is missing from one of
the conjuncts. For example, the sentence “evidence of opacities and bullae in the right eye”
can be interpreted as there being evidence of [opacities in the right eye] and [bullae in the
right eye]; or, alternatively, as evidence of [opacities] (the location of which is unknown) and
[bullae in the right eye]. While possibly clear to a clinician, for a computer these ambiguities
must be explicitly resolved.
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 14
Image display Report(working)window
Keyboard
Headset
Figure 2.1: Typical radiology workstation.
2.3 Medical Language Processing in Radiology
2.3.1 The Radiology Environment
As outlined in the first chapter (Section 1.4.1), in most radiology departments once an
examination is complete and the images are ready, a report is dictated and recorded by
the radiologist (not necessarily immediately following the examination). This recording is
then sent to the transcription department where it is added to the queue of reports to be
transcribed. A transcriptionist types the report, checks it for errors and sends it back to
the radiology department for verification and signing3. It is interesting to note that in
some high-volume facilities, a stenographer is present in the reading room for immediate
transcription of the dictated report. Most facilities, however, do not have the volume to
justify the luxury of a dedicated transcriptionist [83].
Figure 2.1 shows a typical radiology workstation layout.
3The signing radiologist is not necessarily the same radiologist who prepared the report.
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 15
2.3.2 The Radiology Report
Although variations may be seen from one radiology clinic to the next, the following shows
a typical layout of a radiology report:
• PATIENT AND HOSPITAL INFORMATION (Demographics)
– Name; hospital or clinic identification number; et cetera.
– Referring physician; Radiologist dictating.
– Date of exam; Date of report.
• “MRI OF THE LUMBAR SPINE”
– Title sentence indicating scan type and anatomical region of study.
• HISTORY
– Patient history such as onset of condition, family history, et cetera.
• TECHNIQUE
– Description of scanning technique, including any special procedures. When usingASR this is often “canned”, that is, stored as a pre-defined block of text that isselected at the time of dictation.
• FINDINGS
– The radiologist’s report on his findings on examination of the radiograph.
• IMPRESSIONS
– The radiologist’s conclusions based upon the findings he has reported. This isoften dictated in bulleted format, and repeats any significant observations madein the FINDINGS section.
• SIGNATURE
– The signing radiologist’s approval, following dictation and transcription, that thereport has been verified as correct.
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 16
2.3.3 Improving Radiology Reporting
Although in place for many years, the radiology-reporting system has much room for im-
provement. The introduction of PACS (Picture Archiving and Communication Systems) [8]
and improved RIS (Radiology Information Systems) has been a step in the right direction.
Similarly, adding MLP technology to the mix could help revolutionize radiology reporting.
The average report turnaround time, or TAT, is the time it takes the referring physi-
cian to receive the completed report. A critical factor in measuring the productivity and
workflow of the radiology clinic, TATs are often upwards of days or more [99]. This is at-
tributable, in part, to the many steps in the reporting process – particularly problematic is
the wait for reports to be transcribed and signed off. Furthermore, most radiology reports
are free text, making automated interpretation or analysis difficult. As a partial solution
to the problems inherent in the current reporting system, some hospitals and clinics have
begun adopting ASR systems to augment their current dictation methods. The hope is to
ultimately eliminate the role of transcriptionist, improving the TAT and overall efficiency
[72].
Accuracy
Current ASR technology, however, is proving insufficient for widespread use in the radiology
department. Although vendors may claim accuracy rates as high as 99%, this still translates
into one error out of every 100 words. Thus, the possibility of having an error-free report is
almost non-existent4. Dr. Forster, who works at a radiology clinic using ASR, suggests that
as little as 10% of reports are ever error-free [43]. Where formerly a trained transcriptionist
would make the necessary corrections, with ASR the role of transcriptionist is removed and
the radiologist must make any corrections himself. Consider this: a radiologist reading 60
exams, who requires 90 seconds per report to proofread, would need to increase his day by
approximately 1.5 hours in order to turn over the same number of reports [52]. Not only
are these corrections essential given the ramifications of errors in healthcare, but they are
costly due to the high salary of radiologists and the time necessary to complete them.
Additionally, the 99-percent figure cited by vendors reflects near-perfect dictation in a
near-perfect environment – not a likely situation in the often frenetic environment of the
4With the exception of “canned”, brief reports such as a normal chest X-ray.
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 17
hospital. This problem is further compounded by the challenge of detecting recognition er-
rors (words incorrectly recorded by the speech recognizer). Frequently they are not replaced
by nonsense words or gibberish, but instead by the next best match in the terminology of the
ASR system (missing words are also very frequent, especially small words such as prepos-
tions). As a result, the errors are often inconspicuous and easily overlooked. This problem
is looked at in more detail in Chapter 3 and again in Chapter 6. Moreover, while actual
medical errors are less common, the presence of nonsensical words is a detriment to the
credibility of the report (for example, misrecognitions such as “sauna” for “centimetre” and
random word insertion errors, such as “jungle”, that are clearly unrelated to the text). Due
to this unpredictable nature of ASR, there is also a frequent need to monitor the dictation
screen in addition to the image, adding visual strain. This also complicates the dictation
task as the radiologist is forced to keep what he is going to say in his mind, while ensuring
that what he has already said is being accurately recorded.
Some researchers have suggested that hiring correctionists will result in lower costs and
more efficient use of a radiologist’s time. The difficulty in detecting these errors, though,
requires that the correctionists be highly trained, which in turn increases their salary. Con-
sequently, the cost-time benefit of ASR versus transcriptionists is lost.
In the case of incorrect words that are detected at the time of dictation, most ASR
systems allow the user to retrain the system on those particular words. Since this is a time-
consuming task, radiologists frequently opt to simply type in the replacement word directly
[43]. As a result, the machine-learning capabilities present in the system are never able
to improve the accuracy of the system and are consequently of little value in this setting.
Therefore, a system that is accurate “right out of the box” will be more valuable.
Users with accents, such as non-native speakers, may also find a higher error rate [69].
Similarly, a user may have a cold or other condition on any given day affecting the sound
quality of their voice that may have a detrimental effect on the accuracy of ASR.
Other Issues Affecting ASR
Unfortunately, in addition to these problems, the adoption of ASR technology into the
radiology department is not necessarily a smooth one. The attitude of the radiologists can
have a profound impact on the success of new technology [97]. This is exacerbated by poor or
incomplete training, often due to the unavailability of radiologists during vendors’ limited
training periods. This is worsened still by an acclimatization period where productivity
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 18
drops as users adjust to a new system and its idiosyncrasies [99]. It is difficult to encourage
users to adapt to new technology when it does not immediately benefit them. Thus, support
from senior management is crucial along with removing alternate dictation systems that may
hinder a radiologist’s ability to adapt [99].
When upgrading to ASR (or upgrading the in-place ASR), there is a risk of software
compatibility issues [72]. It is a difficult task to test new software alongside existing appli-
cations; most vendors do not have the means to set up test environments that accurately
reflect the clinical setting of their clients. Furthermore, much of the software that is at risk
for conflict is often licensed and unavailable to the vendor for compatibility testing before
the ASR program reaches the client [72].
Difficulties in the integration with existing hospital information systems and PACS also
complicate matters. This can reduce efficiency and introduce further errors in the dictation
process. If it is necessary to load the reports separately into the dictation software and
then into PACS, for instance, this can add approximately twenty seconds per report [69]. In
addition, swapping between menus in both systems adds time and the potential for errors
(confusing patient identification numbers, for example), and disturbs the workflow of the
radiologist. Thus, the integration of these systems is vital.
Despite these challenges, many feel that the introduction of ASR into the radiology
suite remains a worthwhile endeavour, and one that some radiologists are now referring to
as inevitable.
2.3.4 Automated Interpretation
In recent years, researchers in MLP have started to tackle the problem of automatically
interpreting and structuring radiology, free-text reports so that they are more accessible to
computer analysis and querying5. This section examines some of the relevant issues.
Report Summarization
Once a report has been dictated, the next step in post-processing is summarization. This can
be broken down into several tasks, including tokenization, stemming, part-of-speech tagging,
and parsing (further broken down into syntactic, semantic, and discourse analysis). The first
three tasks are relatively straightforward and can be handled with existing algorithms. The
5This includes medical, free-text reports in general.
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 19
parsing stage, however, is more complex. The system must maximally capture information
with minimal errors to ensure that the output is of the highest quality and utility. A
system that introduces errors, or glosses over important information will quickly render
itself useless. The challenge is then to determine what information is of value and what
can be safely glossed over; there is a tradeoff between the granularity of the information
retained and the efficiency of the system.
Output Formats
As was proven with the rapid advancement of the Internet, the adoption of information
standards such as HTML can help promote a seamless integration across information sys-
tems. The medical field is no exception; similar benefits can be achieved if an information
standard is established for medical information systems and related software, including au-
tomated summarization. To this end, researchers have begun looking at markup languages
based on the Standard Generalized Markup Language (SGML) that will not only standard-
ize medical documents, but also allow them to be readily accessed via the Internet. SGML
itself is overly extensive and thus too complex for many operations, however, a relatively
new markup language based on SGML, Extensible Markup Language, or XML, captures
the power and expressibility of SGML in a simpler, more flexible format [158].
In brief, XML is a markup language that is readable by both computers and humans.
A markup language encases information between two labels, or tags, that help distinguish
the text from instructions for displaying that text or information about the text itself (for
example, highlighting key phrases in a textbook could be considered an example of “marking
up” a text) [56]. XML accomplishes the task of marking up text through tags that best
describe the contents in a human- and computer-readable fashion. Unlike HTML, these tags
do not contain information regarding formatting or display of the text, they simply store
the data in a machine-readable format6.
Major standardization efforts such as those from the HL-7 (Health Level Seven) initia-
tive now employ XML encoding in their clinical specifications. HL-7 is the most common
standard for interfacing clinical data, which “enables disparate healthcare applications to
6The XML file can then be combined with a language such as HTML or Cascading Style Sheets to displayon a webpage, for instance [56].
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 20
exchange key sets of clinical and administrative data”7. This includes the well-known clin-
ical context management specification, CCOW (Clinical Context Object Workgroup), that
“enables multiple applications to be automatically coordinated and synchronized in clin-
ically meaningful ways at the point-of-use”8. For example, when a clinician opens up a
patient file within one application, the same patient is simultaneously accessed in all other
applications in the same environment.
Other large-scale standardization efforts include the industry-standard DICOM (Digital
Imaging and Communications Medicine), which standardizes the communication of medical
images and information [73]. It “enables digital communication between diagnostic and
therapeutic equipment and systems from various manufacturers”9.
2.4 Natural Language Understanding in Medicine
The ability to recognize word dependencies and interrelations is crucial for a system to suc-
cessfully summarize a medical text. Without such “understanding” of the text, words exist
only as independent entities. This limits systems to little more than keyword search and
structural analysis, missing the subtleties present in real language. In medicine, adding NLU
capabilities to an MLP system allows the transition from a passive system that summarizes
data to a system that can actively interact with the data and clinician to give feedback,
and monitor issues such as drug compatibility. This gives rise to a wide array of more com-
plicated and useful applications, including automated clinical decision support and patient
monitoring, intelligent transcription, automated interpretation and structuring of reports,
and intelligent patient records.
Representing Knowledge in Medicine
One of the crucial challenges in MLP is the development of a standardized means for repre-
senting the salient information found in medical reports. This “salient information” is the
relevant information content of the document and is the information for which the computer
must have some representation for summarization tasks and more advanced tasks, such as
7Health Level Seven Homepage: www.hl7.org. Updated regularly; Accessed: February 2006.8Again the reader is referred to the Health Level Seven Homepage for more information: www.hl7.org.9The Radiological Society of North America Homepage: www.rsna.org Updated regularly; Accessed:
February 2006.
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 21
reasoning. In short, the appropriate formalism must “[be] sufficiently expressive to cap-
ture the information required, computationally tractable for practical cases, and [behave]
predictably in the domain” [114, page 264]
Therefore, in addition to the analysis of a sentence’s structure and meaning, an MLP
system must have a means for representing the information that is contained within [80].
In language processing, the meaning of a particular word is encoded using symbols. This
representation is known as a “type”. Johnson gives the example of the verb treat, which
might be encoded as the type THERAPEUTIC-ACTIVITY10. These types are then further
specified according to hierarchies that identify the relationships that exist between them.
This systematic arrangement is often referred as an ontology or taxonomy [80], a “set of
definitions, which associate a term (the name of a defined entity) with axioms that constrain
its use and relate it to other terms”’ [pg. 81, Falasconi, 1994]. By employing taxonomic
encoding techniques, it is then possible to manage such complex representations using “inex-
pensive” set operations [165, 39]. A closer examination of ontologies in healthcare, including
the challenges present, is provided in Appendix B.
2.5 The Needs of the Radiologist
As with any new technology that is to be incorporated into an existing infrastructure, if
the integration is to be successful it must take into account the users of the technology. All
too often software engineers work hard at developing systems for a particular field without
actually interviewing those who will be using them. Consequently, when the systems are
introduced into the field, they are met with an unwillingness to adapt on the part of the
users and are quickly discarded. Instead, such technology needs to be designed alongside the
user to ensure a good fit. Technology created for the radiology workstation is no exception.
2.5.1 Limitations of an Imperfect System
As mentioned above, a system with a 99% accuracy rate is not as useful as it may seem.
Consequently, it is important to recognize the limited utility of imperfect systems, and the
need for developing ASR systems of even higher accuracy and/or compensatory software.
Although systems that do not meet the accuracy requirements should not be used in sensitive
10In the restricted domain of medicine, the noun treat would likely not have a representation.
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 22
areas (i.e. areas where errors could have serious consequences), they may still be useful in
some instances; when the nature of the accuracy problems are known, it is often possible
for the system to be applied to certain tasks confidently. For instance, a system that does
not give false negatives may be useful in searching tasks where a manual review is required
to remove the extraneous false positives [145].
Integrating with Existing Hospital Systems
By fully integrating ASR and automated summarization into the radiology workstation,
further delays in the system, as well as errors and operator fatigue, can be reduced. As pre-
viously mentioned, speech systems that are not linked directly with the existing information
systems, such as PACS, can introduce delays in the range of 20 seconds per report while
the radiologist scans the current report and then manually loads it into PACS [69]. This is
increased by an additional 20 seconds at the end of dictation while the radiologist navigates
the PACS menus to select a new case. Recall the radiologist from the Section 2.3.3: he now
faces over a two-hour increase to his day. Moreover, it is possible to introduce serious errors
when a report is scanned into the ASR-based system but the incorrect report is called up
in PACS [69].
Initial studies by Hayt and his colleagues have suggested a time gain of nearly 40 seconds
by linking PACS to the ASR system. In this particular instance, when a case is opened in
PACS the corresponding ASR file is opened automatically. When the report is complete
the case can be signed off verbally and a new case is opened in PACS without the unwanted
navigation of menus. As Dr. Forster aptly states, “it is not true speech recognition until we
can put down the mouse” [44].
An ideal integration, as suggested by Dr. Eliot Siegal [132], would allow reports to be
opened based on dictated commands alone, such as “bring up the previous chest CT”, while
increased security could require the use of voice verification as well as a password. As the
sophistication increases, information from prior studies could be imported from PACS into
the present report. Systems involving computer-aided diagnosis (CAD) could also be added
[160], providing a reference tool for the examiner, and helping to ensure that nothing is
overlooked.
CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 23
2.6 Pushing the State of the Art
2.6.1 Overcoming Challenges
There are many challenges facing researchers in the area of automated interpretation. Cur-
rently, there is no metric for the comparison of existing systems and their performance,
making objective analysis difficult. In addition, systems face the challenge of limited do-
main knowledge; a system that is too broad is over-general and suffers a loss of accuracy
[45], while a system that is insufficiently general may not provide enough coverage for the
domain at hand. Furthermore, there is a clear need for standards in the representation of
medical data, including output formats and the report itself.
Most crucially, if a system is to be deployed in a medical setting where it is responsible
for handling sensitive data, it must have extremely high accuracy. This includes a robust
means for handling ambiguity, negation and errors. If a report is returned to a requesting
physician mistakenly identifying a disease or lack thereof, the consequences could be fatal.
The system must also have a strong integration with the existing hospital information system
and PACS (and potentially any ASR system in place).
By building a successful foundation now, it will be possible to fully integrate systems
hospital-wide, from radiology to paediatrics, while making information available across the
country and beyond via the Internet. Accurate statistics on past cases could then easily be
collected and used for research, patient care and decision support.
2.7 Summary
The lure of time and cost efficiency, and improved patient care, is ensuring that healthcare-
related applications in artificial intelligence will continue to grow. Within radiology, this
includes the eventual replacement of transcriptionists with ASR systems, and the addition
of automated interpretation systems in the radiology department. Unfortunately, the low
accuracy rates, among other challenges, are preventing the wide-scale deployment of ASR in
lieu of traditional dication. In the remaining chapters, a closer examination of ASR and the
nature of recognition errors is provided, followed by a solution to the problem of accuracy in
ASR, namely a hybrid error-detection methodology. This will be corroborated with a proof
of concept in radiology reporting, as well as a demonstration of the greater context of this
work beyond medicine.
Chapter 3
A Classification of Error-Detection
Methods
Although accuracy is one of the limiting factors in the widespread introduction of auto-
matic speech recognition (ASR) in radiology, there is little if any work specifically on error
detection in this domain. Nonetheless, work in other contexts, such as spoken dialogue
systems [91], is useful for creating a methodology of error detection that is applicable to the
overriding problem of ASR in radiology dictation.
I develop an original classification for error-detection methods in ASR. Since one does
not presently exist in the literature, this sets the groundwork for future endeavours to be
objectively measured. This chapter presents this classification, and provides examples from
the literature where they exist. First, though, an introduction to speech recognition is
presented to help familiarize the reader with the relevant concepts and terminology.
3.1 Background
3.1.1 The Stages of Error Handling in Speech Recognition
The handling of recognition errors can be broken down into techniques applicable at various
levels throughout the recognition process [150]:
Error Prevention Preventing a recognition error altogether.
Error Prediction Detecting the likelihood of errors based on weaknesses in the system.
24
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 25
Error Detection Identifying recognition errors that have occurred.
Error Recovery This can be broken down into the following stages:
Diagnosis of Cause Identifying the sources of the error to guide error correction.
Error Correction Choosing and implementing the error correction strategy and in-
forming the user of changes made.
Error Handling Feedback Where relevant, the performance at the error detection and/or
correction level is collected for future applications (for example, machine-learning
methods).
3.1.2 On the Nature of Recognition Errors
As outlined in Kukich [87], there are five levels of text-based errors:
1. Lexical/Structural
2. Syntactic
3. Semantic
4. Discourse
5. Pragmatic
It is not possible for the speech recognizer to introduce errors at the discourse or prag-
matic level since no recognizer-level processing occurs at these levels1. Furthermore, since all
words are produced from a pre-defined lexicon, lexical errors are also not possible . Depend-
ing on the domain, however, errors pertaining to the misrecognition of specially formatted
lexical items may arise. This is frequently seen in the interpretation of radiology reports
where complex lexical items such as “L4/5”, representing the fourth and fifth lumbar ver-
tebrae, are misinterpreted as “L for/five”, for example, or the lexical representation of the
numbers is erroneously substituted for the orthographic representations (i.e. “four” versus
“4”). While these remain correct lexical elements, such errors nonetheless seem to sit below
1An exception to this are errors that follow as a side effect of errors at the syntactic or semantic level.
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 26
Semantic Errors
Structural ErrorsSyntactic Errors
Figure 3.1: The relevant, overlapping error levels in radiology.
the level of syntactic and semantic errors. To refer to these instances, I have used the term
“structural”, which represents such errors as a subset of lexical errors.
The structural, syntactic and semantic error levels overlap in instances where a recog-
nition error is recognizable as an error across more than one level. For example, the mis-
recognition, “See four/5” is both a structural error and a semantic error (and potentially a
syntactic error, depending on the surrounding sentence). Figure 3.1 shows the overlapping
error coverage of these levels.
Considering the specific needs of radiology reporting, a further evaluation is offered,
applicable to all error levels and reflecting the inherent strength of the error. “Weak”
errors result in little or no change in the overall semantics and thus no shift in the report
interpretation. For example, the omission of a determiner rarely causes enough semantic
damage to be misinterpreted by the clinician. “Strong” errors, however, cause a major shift
in the semantics. Such errors may be readily identifiable as outliers within the domain,
for example the word “elephant” appearing in a radiology text; or may be inconspicuous
and hard to detect, for example, the substitution of one medical term for another that may
still be valid in that context. Kanal et al distinguish such errors with respect to radiology
reports according to the following four levels [83]:
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 27
Class 0 No change in meaning with respect to the original report.
Class 1 No change in meaning, but text is grammatically incorrect.
Class 2 Change in meaning, but error obvious.
Class 3 Change in meaning, but error subtle.
The authors divide error classes 2 and 3 into “significant” errors, while class 3 errors are
considered “subtle significant” [83]. As with all ultimately subjective measures, however,
there is a risk of inconsistency and caution should be noted when relying on these sorts of
descriptors. Differences between institutions such as the reading-room environment, user
variability, report quality, and existing infrastructure can all affect report quality and the
nature of the errors found in dictated reports. Consequently, a rigorous definition and
accounting of errors is difficult. For consistency, throughout this document any discrepancy
from the correct or reference report will be treated as a recognition error.
In general, there are six recognition error types that can cause errors at the structural,
syntactic or semantic level:
Stop Word Errors Any error involving a stop word (i.e. words with low semantic load,
such as prepositions, determiners, et cetera). In general, stop words can result in
errors at the syntactic or semantic level.
Merge Errors Two or more words erroneously recognized as a single word [121]. E.g.
“wreck a nice” → “recognize”.
Split Errors A single word erroneously recognized as two or more words [121]. E.g. “rec-
ognize” → “wreck a nice”.
Substitution Errors The replacement of one word by another [54].
Insertion Errors The insertion of a word that is not part of the original utterance [54].
Deletion Errors A word in the original utterance that does not appear in the final ASR
output.
Deletions errors are difficult to detect as they typically leave little record of their absence.
Similarly, the detection of stop word errors is also difficult due to their prevalence in the
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 28
language and the small semantic role they play. As a result, many error-detection systems
focus on the remaining four error types. By incorporating a range of error-detection meth-
ods, it is possible, however, to draw the complemetary strengths of each, such as deletion
detection, into a single error-detection system, as will be shown in this dissertation. A more
detailed discussion of this and the role of stop words in error detection follows in Chapter
4.
3.2 A Brief Introduction to Automatic Speech Recognition
In the space of little more than a decade, automatic speech recognition (ASR) has advanced
from discontinuous, or isolated-word systems, for which users are required to clearly separate
each spoken word by a pause, to continuous recognition systems in which users are able to
speak “freely”. Current systems can achieve accuracy rates as high as 99% and have seen
application in a variety of tasks including automated call processing, driver commands in
vehicles, and sub-titling for live sporting events. Within the radiology department, ASR
allows clinicians to dictate their reports directly into the computer, avoiding the need for
note-taking or transcriptionists.
ASR can be largely divided into four core technologies [99]:
1. Synthesis of human-readable characters into speech;
2. Speaker identification and verification;
3. Recognition of human speech; and
4. Natural language understanding
Speech synthesis, or text-to-speech, allows computers to produce spoken output based on
text as input. In speaker identification and verification, speech input is used to authenticate
or identify a particular speaker. Perhaps of greatest interest to medical language processing
(MLP), though, is the recognition of human speech and natural language understanding.
Throughout this document, “ASR” is used to refer exclusively to the recognition of human
speech, while “NLU” is used to differentiate natural language understanding.
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 29
SourceSentence
NoisySentence
SentenceGuess
DecoderNoisy
Channel
Figure 3.2: The noisy channel model, based on Jurafsky and Martin, Figure 7.1, [81, page237].
3.2.1 Recognizing Human Speech
In general, ASR systems function on the basic premise of the probabilistic noisy channel
architecture [81]. Acoustic input is treated as if it is a “noisy” version of the source sentence,
and is correspondingly decoded in an effort to find the “true”, underlying sentence, as shown
in Figure 3.2. Required at the decoder level is a search algorithm that searches the space of
all possible sentences in order to find the best match for the noisy input, i.e. the sentence
with the highest probability [81]. As a side effect of the decoding process, a hypothesis list
for each utterance is produced, where utterance can be represented at the sentence, word,
or phone level. As will be shown in Section 3.5, the “N-best” of these hypotheses can be
used to assist certain error-detection methods.
Popular statistical decoding algorithms include the Viterbi algorithm [96] and Hidden
Markov Models [96]. In non-statistical methods, decoding templates are used to identify
recognition candidates; a database of sound patterns is stored as sequences of frames to
which the input sound frames are compared. The output from these decoders is limited
by the acoustic and language models that restrict the set of possible utterances. Acoustic
modeling relies on acoustic properties of the language, while language modeling relies on
properties of the domain and the language structure itself.
Speech Recognition in Radiology
Within the context of radiology, six key requirements for the successful integration of ASR
in the reading room are identified [Mehta et al, 1998]:
1. Integration with existing hospital information systems (HIS);
2. Availability of “canned” or pre-stored reports (such as a normal chest X-ray) and
templates (standardized report forms);
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 30
3. Allowable additions to a completed report even after it is “signed”;
4. User-defined fields to maintain flexibility and control over the report setup;
5. Barcode interface (this also relates to the integration with HIS); and
6. Security of patient information (e.g. password protection of sensitive materials).
Beyond the software level, a successful ASR system is also reliant on the hardware
supporting it [99, 83, 160]. An example includes a high-tech microphone with noise-canceling
capabilities. Even in the quiet of a radiology reading room, there exist ambient noises from
people and equipment that can result in unwanted input to the system. In addition, the
computers supporting the speech recognizer must be powerful enough to avoid delays and
other complications, and preserve the workflow of the radiologist. Without such equipment,
there is risk of further errors and frustration to the user.
Current ASR Systems
Unfortunately, comparative studies of ASR systems in medicine are rare. In 2000, Devine
[38] performed a comparison study of three systems as they performed “right out of the box”,
that is, with the bare minimum of required training. After examining the performance of
IBM ViaVoice 98, Dragon NaturallySpeaking Medical Suite, version 3.0 and L&Hs Voice
Xpress for Medicine, General Medicine Edition, version 1.2, he concluded that ViaVoice
significantly outperformed the other two systems in consistent recognition accuracy. Al-
though careful to point out that later versions of these software programs might render
his results obsolete, a similar study was released showing IBM ViaVoice again significantly
outperformed the Dragon NaturallySpeaking Medical Suite, version 5.0, this time on French
medical dictations [63], suggesting that Devine’s earlier conclusions may still be valid. In
addition, the Canada Diagnostic Centre (a local radiology clinic) has been working with
Dragon NaturallySpeaking Medical Suite (version 8.0)2, and has had numerous complaints
with respect to low accuracy rates. Radiologists at the clinic estimate as few as 10% of
dictated reports are ever error-free [43, 44].
Although other companies have also developed ASR systems, there are no impartial,
comparative studies available at this time. It is clear that further studies comparing the
2Version 8.0 was installed in January 2006, as an upgrade from Version 7.3.
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 31
recognition rates, as well as dictation/correction rates, of currently available systems are
needed before any qualitative evaluation and discussion is possible. Regardless, the current
state of ASR performance is inadequate for the purposes of radiology reporting.
3.2.2 Natural Language Understanding
In many respects, the ability of computers to understand and communicate freely with
humans is the defining technology of artificial intelligence. The area of natural language
understanding, or NLU, is at the very root of this freedom of communication.
Central to any NLU system is the translation of natural language input into a machine-
readable format, where “machine-readable” refers to data that can be processed by a com-
puter. While computer “understanding” of a text does not have the same connotations as
with a human, it should entail the ability to process data in order to interact with people
in a more intelligent manner. This means having some internal structure for the concepts
present in the natural language input, a means to extract those concepts, and finally a way
to reason about them.
For the purposes of this thesis, the focus is exclusively on the recognition of human
speech, leaving NLU as a separate area of pursuit. In Chapter 5 future possibilities are
discussed for later advancements of error detection and correction using NLU.
3.3 Confidence Scoring
In general, speech recognizers can be evaluated on the basis of their recognition accuracy.
This is commonly determined via the word-error rate (WER)3, a measure of the differences
between a recognized string and an actual utterance measured at the word level [81]. It
is possible, however, to determine a ranking for the individual components of a recognized
string in the form of a confidence score that directly represents the probability that that
string is correct. By modeling a recognized string or text in this fashion, it is possible to
direct error detection and correction more intelligently4.
In general, a confidence score reflects the overall result of a set of confidence measures.
3See also Section 4.3.4Note that confidence accuracy is not equivalent to recognition accuracy. A speech recognizer can have
poor recognition accuracy, while the confidence accuracy is high. That is, low confidence rankings arecorrectly assigned to the erroneous recognizer output.
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 32
Typically these measures reflect statistical properties of the acoustic model and the language
model, divided into the phonetic, utterance, and word levels. Overall, “the features which
are utilized are chosen because, either by themselves or in conjunction with other features,
they can be shown to be correlated with the correctness of a recognition hypothesis.” [71,
page 2].
In some studies [70], [71], the researchers compute word-confidence scores, based pri-
marily on acoustic qualities, as a post-processing stage following speech recognition. These
measures are combined into a single feature vector which is then compressed via a projec-
tion vector to obtain the final confidence score. This confidence score is expressed as the
following (where ~p is the projection vector, ~f is the feature vector, c is the “ raw confidence
score” [70, page 2] and T is the thresholding factor):
c = ~pT ~f (3.1)
The researchers set a threshold value to “adjust the balance between false acceptances of
misrecognized words and false rejections of correctly recognized words” [70, page 2]. The
projection vector relies on a minimum classification error (MCE) technique. Nonetheless,
while such a simplistic approach worked well (reducing the false acceptance rate of mis-
recognized terms by as much as 25% in some cases), a more powerful classifier such as an
artificial neural network may ultimately prove more successful [70].
In another study [162], the researchers use the posterior probability of a word “given all
acoustic observations of the utterance”, as an indicator of confidence. They discovered a
relative reduction in confidence error rate between 19% and 35%.
In general, the confidence of a particular utterance is reflected in its N-best score – the
score that either the decoder assigns to the decoded utterance, or is later assigned by a
separate error-detection algorithm [37].
Although these methods are exclusively statistical, non-statistical, rule-based methods
of confidence ranking are also possible that do not rely on the internal ranking of the ASR
system.
The usefulness of confidence measures can be seen in their ability to direct the focus
to potentially problematic areas of a text. Such measures can be used as indicators of the
possibility of errors in areas of low confidence, and when applied with a threshold value, to
tag those words whose confidence ranking is too low.
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 33
3.4 A Classification of Error-Detection Methods for Speech
Recognition
The short survey in the previous sections provides the information needed to propose a
classification of error-detection methods for ASR. Such a classification will make it possible
to discuss aspects of error detection in a more formal and controlled manner (avoiding the
ad-hoc discussions that currently characterize the literature) as well as compare and contrast
not only specific methods, but categories of methods as well.
3.4.1 The Classification
Error-detection methods in speech recognition can be divided into two broad categories:
• Non-Black-Box Methods
• Black-Box Methods
In Black-Box Methods, the internal recognizer information, the utterance hypothesis
list produced by the decoder, is completely inaccessible. In other words, the recognizer is
opaque, or a “black box”, for which we see only the input and the output. In Non-Black-
Box Methods, the recognizer is transparent, allowing us to access the internal ranking
information that the recognizer uses in producing its output.
Each category can be further classified into the following:
• Probabilistic Approaches
• Non-Probabilistic Approachces
• Hybrid Approaches
As we will see, there are advantages and disadvantages to working with or without
the black-box assumption. We next look at the various possibilities for non-black-box and
black-box error detection according to our classification.
3.5 Non-Black-Box Methods
This section presents a closer look at the possibilities for error detection in non-black-
box (NBB) methods, including examples of their application in the reference literature
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 34
wherever possible. NBB methods refer to error-detection systems that interface directly
with the speech recognizer. In these instances the internal ranking information that the
recognizer uses in producing its output is accessible. For example, given a Viterbi decoder
such information will take the form of likelihood ratios [32]. This information can be used
in a variety of ways, including comparison to a second decoder or a recognizer running in
parallel; input to a classifier such as Hidden Markov Models [81]; or in combination with
higher level analyses, such as the semantic level. The result is a measure of confidence in the
speech recognizer’s original output or an alternative output hypothesis for the utterance.
“N-best” Score
Given whatever model/decoder is used, in NBB methods the N-highest hypothesis scores
from the decoder can be used to create a list of the “N best” hypotheses corresponding to the
input segment. Such an “N-best list” can provide input to other error-detection algorithms
that will in turn “re-rank” this list, resulting in their own “N-best list”.
3.5.1 Probabilistic Approaches
As Gillick et al [54] observe, the most basic probabilistic confidence measure in a speech
recognizer’s output is simply the result of a long term average over the performance of the
recognizer itself. That is, the percentage error rate, p, collected over some timeframe, t. This
naıve approach has many failings, not the least of which is the failure to account for the effect
of the surrounding words on the resulting probabilities. The following sections examine the
efforts to refine this technique and create more intelligent probabilistic approaches for error
detection and confidence ranking.
Language Modeling
Recall that in ASR, the decoder output possibilities are limited by the language and acoustic
models in place describing the probability of a particular utterance. Essentially, through a
variety of statistical techniques, it is possible to estimate the probability of a word occurring
based on the previous words recognized. An early attempt at a more intelligent use of such
statistical language models was Kuhn [86]. Kuhn observed that the likelihood of a word
was higher if had been spoken recently, suggesting a trend of coherence throughout a text
that could be exploited by weighing more recent words more heavily.
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 35
In general, the language model can be enforced using algorithms that reflect the state
of the domain, such as Kuhn’s weighted algorithm above. For example, using posterior
probabilities and Bayes’ Theorem, we can determine the optimum word sequence, W [77],
as shown in Equation 3.2.
W = argW
maxP (W |O) = argW
maxP (W )P (O|W ) (3.2)
Here W = w1, w2, ..., wn is a candidate word sequence, and O = o1, o2, ...on is the utterance,
or output sequence from the speech recognizer. P (W ) and P (O|W )5 are the source model
and channel model, respectively6. P (W ) can be determined via Equation 3.3:
P (W ) =∏
i
P (wi|w1,i−1) (3.3)
The condition w1,i−1 refers to the words occurring prior to the target word, wi. Based on the
assumption that the ASR output words are independent, we have the following Equation
(3.4) [77].
P (O|W ) =∏
i
P (o1,i|w1,i) =∏
i
P (oi|wi) (3.4)
Thus, the optimum sequence, W , becomes, finally, Equation 3.5 [77].
W = argW
max( ∏
i
P (wi|w1,i−1)∏
i
P (oi|wi))
(3.5)
In Allen et al, and Ringger and Allen [1, 121], the authors rely on the likelihood of recog-
nition errors, as well as statistical data such as co-occurrences and word N-grams. N-grams
refer to the divisions representing the N words occurring in the context of the target word.
A “unigram” then refers to the word itself, a “bigram” to a two-word pairing, and so on
[77]. They observe that the assumption of independence above is an oversimplification that
neglects split or merge errors. Instead they permit a small window, such as P (oi−1, oi|wi) or
P (oi|wi, wi+1), that allows the system to make predictions based on the surrounding words,
theoretically improving the merge/split problem [121].
5The probability of an accidental word to word transformation.6Note that the denominator P (O) predicted by Bayes’ Theorem can be dropped as its probability is a
constant and likewise independent of W .
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 36
In Jeong et al [77], however, the authors observe that the methods employed by Ringger
and Allen [121] do not show the increase in accuracy expected. They suggest that data-
sparseness is to blame due to the large amount of word-level correction pairs needed to
adequately characterize the search space7. To collect such pairs requires a prohibitively
large amount of training data. Instead, the authors propose the collection of sub-word
(i.e. syllable) correction pairs to overcome data-sparseness. By breaking up the words, it is
possible to achieve a greater number of correction pairs with the same amount of training
data. This syllable-channel model is shown in the following equation [77, 121]:
W = argW
max(P (W )P (X|W )P (S|X)) (3.6)
Where X is the source syllable sequence, P (X|W ) is the word model, and P (S|X) is the
probability of a syllable to syllable transformation. Jeong et al demonstrated a 6-7% increase
over their baseline recognizers using the syllable-based method and tested on a Korean
question-answering system.
Alternative methods exist based on other probabilistic techniques, such as Hidden Markov
Models [93]. All of them share in common the notion that the previous words in a sequence
carry important information about the probability of the current or upcoming word.
3.5.2 Non-Probabilistic Approaches
Higher Level Feature Analysis
In addition to low-level lexical and statistical information, higher level information such as
prosodic features can be used alongside the recognizer’s own confidence score. In the case of
dialogue systems, Litman [95] observes that when people re-state their utterance they often
over-emphasize their words (a prosodic change), leading to poor recognition accuracy. Fur-
thermore, differences in gender, age, native-speaker status, and even temporary influences
such as colds, can affect speaker prosody. Based on such prosodic features as utterance
duration and speaker rate, Litman used a machine-learning algorithm to learn if-then-else
rules, which classify a recognition as correct or incorrect. Used in combination with the
ASR confidence measures she was able increase the overall accuracy of the system over the
acoustic score alone [95].
7The problem of data sparseness is addressed again in Chapter 6.
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 37
In addition to prosody, features at the semantic and syntactic level can also be accessed.
Lieberman et al [93] use semantic information to re-rank the recognizer’s hypothesis list.
Those hypotheses that are semantically relevant to the context in which the utterance occurs
are moved higher in the rankings. They give the example of “my bike has a squeaky brake”.
Initially, the recognizer will select “break” due to its higher individual word probability,
instead of “brake”. Given the context of “bike”, however, the system is able to determine a
set of related concepts, using the semantic network ConceptNet [93]. Of this set “brake” is
a member, but “break” is not. Lieberman et al observe that by relying on a smaller corpus
of semantic knowledge only, the smaller amount of data along with greater natural language
processing means that a larger context can be considered in an N-gram model without
becoming intractable. Statistical techniques, in contrast, rely on low-order N-grams of no
more than two or three words. Using the method of commonsense reasoning to re-rank the
candidate hypotheses, Lieberman et al estimate an overall 17% reduction in errors (based
upon a post-analysis of the actual dictation errors) [93].
Parallel Recognizers
By aligning the output of the ASR word- or utterance-level recognition with a paral-
lel, phone-level recognizer, it is possible to identify inconsistencies which may indicate
errors [33]. For example, the speech recognizer may have decoded the following phone
sequence q1, q2, ..., qN(i) for word wi, while the phone recognizer identified the sequence,
p1, p2, ..., pN(i)[33]. Comparing qi to pi can be useful in identifying error candidates, effec-
tively separating the language modeling component from the acoustic modeling component
(represented independently in the phone analyser). One advantage of such an approach is
that it avoids exclusive reliance on the decoder algorithm. Cox and Dasmahapatra found
that while the parallel phone recognizer produced statistically significant results, they did
not improve on the baseline N-best technique found in Gillick et al, 1997 [54], which relied
on the stability of a word’s position in the recognizer word lattice8.
8A lattice represents the probability of each word in the output sequence in terms of the probabilities ofthe words preceding [96].
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 38
3.5.3 Hybrid Approaches
Since NBB methods access the internal ASR confidence measures, it is by default a hybrid
approach when a typical recognizer (relying on statistical decoding) is combined with any
error-detection method that uses non-statistical features to re-rank utterance hypotheses.
For example, Lieberman’s approach described above adds higher level semantic knowledge
in order to re-rank the ASR output.
The goal with hybrid approaches is to take advantage of the strengths of both the prob-
abilistic and non-probabilistic approaches, while using their complementary error coverage
to balance out their weaknesses.
Jeong et al increase domain-specific recognition by combining their syllable-channel
model, described in 3.5.1, with a semantic analysis that is sensitive to both semantic and
lexical errors [77]. At the semantic level, they obtain the necessary semantic information
from their own generated domain dictionary, and more general thesauri. Lexico-semantic
patterns, or LSPs, are collected into a template database based on abstractions of partic-
ular word sequences found in the training data. Queries are mapped to their own LSPs
and then matched to the template LSPs when an error is suspected. Templates with the
minimum distance from the query LSP are selected as replacement candidates. On its own,
the LSP method gave a 4% increase in accuracy over the baseline method, and a 6-8%
increase over the baseline when combined with the syllable mode and tested on a Korean
question-answering system.
Similarly, in Cox and Dasmahapatra 2002, latent semantic indexing as a measure of term
similarity is combined with N-best ranking (as in Gillick et al, 1997 [54]) and is shown to
be an improvement over either technique individually [33].
3.6 Black-Box Methods
This section examines black-box methods for error detection. By assuming a black-box
scenario where the internal rankings of the ASR software are unavailable, the drive is to
develop post-processing solutions, that escape the need to deal with the complications of
proprietary software. Furthermore, there is no restriction to a particular software suite,
thus the system is capable of handling input from any system. This in turn better reflects
the varying needs of reading rooms supporting different vendor software packages. As Cox
and Dasmahapatra observe [33, 32], the performance of methods relying on ASR-dependent
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 39
information may vary based on the ASR system or decoding algorithm being used.
3.6.1 Probabilistic Approaches
The unifying theory underlying probabilisitic approaches is the understanding that human
languages are probabilistic entities, rather than fixed and absolute. Competence in the lan-
guage is therefore experience-based, depending on the frequency of observation of linguistic
and linguistic-related events. When applied to natural language processing, the research
focus is to automatically identify the frequency of events in a text, and use that information
to predict the features of novel texts.
This can readily be applied to error detection given the assumption that ASR errors
“occur in regular patterns rather than at random” [82]. Given a corpus of natural language
texts on which a system can train, it is possible to identify these patterns, and use their
frequencies to assess future texts. If the corpus is large enough to be a representative sample
of the domain of discourse, then those frequencies can be extended beyond the corpus to
the entire domain. In considering a novel text, given the observation of a new event, such as
a word occurring in a particular environment, if that event is sufficiently improbable based
on the training data, the most likely explanation is a recognition error.
The following describe common probabilistic tools for language analysis as applied to
error detection, with examples in the research literature where possible.
Latent Semantic Indexing
Latent Semantic Indexing9 (LSI) uses the co-occurrence of terms to determine the degree of
relatedness between them. Two terms co-occur if one occurs within the context of the other,
where “context” refers to the surrounding words. The general idea is that variability in word
choice due to synonymous words and phrases can make it difficult to identify semantically
related documents [96]. If each term in the domain and each document is represented
in multi-dimensional space, by restricting co-occurring terms to the same dimension we
can reduce the total number of dimensions overall and thus the noise. The result is a
compressed space with “latent” semantic dimensions in which document or term similarity
can be measured via vector cosine measures. The reduction of co-occurring terms to semantic
9Often referred to as Latent Semantic Analysis.
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 40
dimensions means that it is possible to determine the similarity between documents, even
when they have minimal terms in common [96].
Consider the following example provided by Manning and Schutze, [96] in Table 3.1.
Given our query, if we rely on keyword search alone, only Document 1 will be returned. Since
the terms “HCI” and “interaction” co-occur in Document 1 and Document 2, however, it is
likely that Document 2 is also related to the query. LSI allows us to determine a measure of
just how semantically related two terms are, or, by extension, two documents based on the
terms within, and thus provides a measure of similarity between the query and Document
2.
Table 3.1: An example of the usefulness of co-occurrence relations in determining similaritybetween documents and queries [96, Page 554]
Term 1 Term 2 Term 3 Term 4Query user interface
Document 1 user interface HCI interactionDocument 2 HCI interaction
The notion of semantic similarity can be applied quite naturally to error detection if we
consider the assumption that many recognition errors are likely to be words that share little
semantic similarity with the neighbourhood of words in which they co-occur10.
Co-Occurrence Relations
Co-occurrence relations are a statistical method for determining the number of times a word
occurs in a specific context [81, 96, 131]. Given a sufficiently representative training corpus,
words can be associated with particular contexts based on that corpus. These word-context
statistics can be applied to determine the probability of a word occurring in a given context
in a text. If that probability falls below a certain threshold, the word will be flagged as a
possible error. This technique was applied to the analysis of dialogue queries in [131] and
to radiology reports in [154]. This latter application is expanded on in detail in Chapter 5.
10Cox and Dasmahapatra [32] used this assumption in their LSI algorithm for determining semantic con-fidence measures for recognizer output. They note that although it was a weak indicator of errors overall,LSI was nonetheless complementary to the basic decoder-only N-best list. Thus, by combining the semanticconfidence measure and the N-best confidence measure they were able to improve over the baseline decodermeasure [32].
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 41
Sarma and Palmer, [131] use co-occurrence statistics to perform a context analysis on
the words in a query in order to detect and then correct errors. Given a query word, the
researchers determine the context window for that word, based upon its occurrence at the
centre. If the surrounding context words do not match the target word, they can be used
to identify misrecognition candidates, words for which the context words are appropriate.
From this list of candidates, the phonetic similarity between each word and the target word
is determined. If a candidate is both context-appropriate and phonetically similar to the
target, then it is considered likely that the target word was a misrecognized form of this
candidate [131].
Pointwise Mutual Information
Pointwise Mutual Information, or PMI, is a statistical measure of the degree of independence
between two variables and is defined in Equation 3.7 [96].
PMI(x, y) = logP (x, y)
P (x) · P (y)(3.7)
Here P (x, y) is the probability of x and y co-occurring, while P (x) and P (y) is the individual
probability of x and y occuring, respectively. If the probability of P (x, y) is larger than the
combined probability of x and y occurring individually, they will have a small measure of
independence, with P (x, y) = P (x) = P (y) being maximally dependent, and P (x) · P (y) =
P (x, y) being maximally independent [96].
As Manning and Schutze observe, measures of mutual information are particularly sen-
sitive to data sparseness [96]. Considering the case of maximum dependence above, where
two words only occur together, the value of PMI(x, y) becomes log(1/P (y)). This means
that the rarer the occurrence of (x, y), the higher the degree of mutual information. This
makes little sense, as words of higher frequency will be scored lower, despite the presence of
more evidence to support the score. Consequently, PMI is a poor measure of dependence.
For the purposes of ASR output in error detection, however, the focus is on the degree of
independence that one word shows from its surrounding context. Terms that demonstrate
a high independence are likely candidates for recognition errors. Inkpen and Desilets [75]
use this idea to determine errors in transcripts of meeting. By establishing a target word’s
neighbourhood of surrounding words, it is possible to calculate the PMI value for each
of those context words and compile them into a single value. This value represents the
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 42
“semantic coherence” (SC) of the target word. Those SC values that fall below a certain
threshold are then marked as indications of possible errors.
In addition, the constrained vocabulary within radiology means that a smaller training
corpus is needed for more complete coverage, reducing the problem of data sparseness. Data
sparseness is discussed in more detail in Section 6.8.
3.6.2 Non-Probabilistic Approaches
Pattern Matching
A common, non-probabilistic, rule-based approach to error detection relies on the exploita-
tion of error patterns. By collecting a database of common error patterns relevant to a
particular language (or domain), it is possible to use rules to compare the ASR transcrip-
tion to this database. While such approaches are often very accurate within the domain of
the error database, they are nonetheless fragile. Any errors that do not have corresponding
templates in the database will be overlooked as the system cannot generalize beyond what
is known (i.e. what is in the database) [82, 77]. In addition, they are susceptible to false
positives in cases where correct words happen to occur in a known error context [77].
Kaki et al [82] developed an error correction system based on these principles. They
collected a database of common lexical errors and their corrections for Japanese. When a
string was encountered that matched an error template in the database, it was replaced by
the corresponding correct string.
Conceptual Similarity
The comparison of concepts is necessary in a variety of human and machine reasoning
tasks, and allows high-level reasoning beyond the lexical and syntactic level. Importantly,
it is possible to derive a “quantitative similarity score between two concepts” [23, Page 77].
General semantic similarity techniques include the use of vector space measures and set
operations [96]. Although traditionally seen as relevant for vocabulary development and
maintenance, datamining, and decision support, in specialized domains such as medicine,
conceptual similarity is also applicable to error detection.
Given access to a hierarchical knowledge base there are two primary approaches to
determining conceptual similarity, namely edge-based and node-based similarity.
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 43
Edge-Based Similarity
In Caviedes and Cimino [23], the authors examine the problem of a conceptual distance met-
ric for the Unified Medical Language System (UMLS), a broad-coverage medical-language
ontology11 [103]. Despite the lack of homogeneity and the presence of inconsistency within
the ontology, they acknowledge that the UMLS is nonetheless progressing strongly towards
meeting formal terminology requirements [23] (see Appendix B for a more detailed discus-
sion). In their conceptual similarity metric they exploit the hierarchical structure of the
UMLS which, by default, places similar items nearer to one another. Previous work in this
area has indicated that a reasonable metric can be derived from the minimum path along
broader-than, or RB, links [113]. Caviedes and Cimino [23] extend this notion to include
parent, or PAR, links, which are semantically similar to is-a links but subsumed by broader-
than links. The authors note that “[o]ther Euclidean metrics based on geometric distances
in a feature space... are possible but very likely too computationally expensive for practical
use” [23, page 78]. As a rough solution to the problem of inconsistencies within the UMLS,
they assume the PAR trees are directed acyclic graphs (DAGs) and discard any cycles. They
acknowledge, however, that the ability to search within verified DAG hierarchies would im-
prove the accuracy of the distance values calculated, and that further research is needed
[23].
The authors calculate two values: the depth and the conceptual distance, CDist. The
depth value is a measure of the actual depth within the concept hierarchy and reflects the
specificity of the concept. Deeper concepts are more specialized, while shallower concepts
are more general (with the root concept being maximally general). Specifically, depth is
defined as the “shortest path from the most specific common ancestor [between the two
concepts being compared] to a root concept”[23, page 81]. The CDist is one measure of
conceptual distance calculated based on the “minimum path avoiding circular and infinite
paths”[23, page 79].
There have a been a few suggestions as to the nature of the relationship between the
depth and the conceptual distance. Caviedes and Cimino [23] suggest the following metric:
Conceptual distanceinv∝ depth (3.8)
11See Appendix B for a detailed discussion.
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 44
This reflects the effect of depth on the generality of the concept and provides a means to
differentiate two concepts whose CDist values, as defined above, may be the same but whose
depth values differ. Other suggestions include a weighting value so that concepts nearer the
top of the hierarchy are less similar than those farther down [148, 119, 122, 84]. Richardson et
al [119] also use a measure of concept density within the hierarchy. They observe, however,
that irregularities in the densities give rise to unexpected distance measures [119, 122].
Roddick et al therefore extend this approach by transition costs accrued whenever a node
is traversed, and a “zooming” factor that gives preference to concepts that are closer to the
target concept [122] .
Spanoudakis and Constantopoulos determine overall distance or similarity based on a
combination of partial distance factors that reflect different levels of detail: identification,
classification, generalization, and attribution [122, 141, 142].
Bousquet et al [20] use the weighted projections of two concepts (in their case, diagnoses)
along various axes and apply a vector distance calculation, Lp norm (a variant of L norm
[96]), to calculate the semantic distance between the two concepts. Thus, the semantic
distance or similarity between two concepts A and B can be determined using the following
Lp norm calculation:
Lp(A,B) =(WX |XA −XB|p + WY |YA − YB|p
)1/p (3.9)
Where X and Y are the axes, Wi stands for the weighted value on that particular axis, and
p represents a normalization constant.
Node-Based Similarity
One problem with the edge-based approach is the assumption that links within a vocabulary
represent uniform and symmetric distances [118, 78]. In fact, these distances can vary,
particularly in areas of high density, or where non-is-a links are used [118, 78]. Thus,
researchers have augmented the distance calculation via weights to reflect the information
content of a node (or concept).
As an initial approach to this problem, Resnik determined the informational content
(IC) of a concept, c, based on the informational theory of inverse log likelihood [78, 96, 118]
(here P (c) is the probability of c):
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 45
IC(c) = − log P (c) (3.10)
Intuitively, as the probability of a concept increases the corresponding information content
decreases: concepts that are relatively high frequency (e.g. those that are higher up in the
hierarchy and consequently more general) provide a relatively small amount of information
[118]. It follows that “the more information that two concepts share in common, the more
similar they are” [118, page 2]. Within the taxonomy, this information content is determined
by the two subsuming concepts:
sim(c1, c2) =cεS(c1,c2)max IC(c) (3.11)
Where S(c1, c2) is the set of concepts subsuming c in the network.
3.6.3 Hybrid Approaches
Like NBB methods, a hybrid approach of BB methods attempts to balance the weaknesses
of an individual approach with the complementary strengths of another. Thus, despite
the fragility of pattern matching, employing templates of common errors may increase the
performance of statistical techniques such as co-occurrence analysis. Likewise, the problem
of false positives in pattern matching may be offset by the co-occurrence score of the term or
phrase in question. Similarly, those errors for which insufficient training data was available
could be instead captured using non-probabilistic techniques.
Currently no such hybrid, BB methods for error detection exist to the author’s knowl-
edge. Chapter 4 contributes an original conceptualization for a hybrid, BB method of error
detection as applied to radiology reporting. Chapter 5 then provides a proof of concept
through a series of experiments.
3.7 A Note on Stop Lists
Stop words are words with little intrinsic meaning or semantic weight, such as “at” and
“the”. Typically, these words are found with such high frequency in the language that they
serve only as noise, losing all usefulness as search terms. In statistical analyses, stop words
are usually omitted since their overabundance in a text can affect the resulting probabilities
CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 46
disproportionately. A list of stop words to be excluded from an analysis is referred to as a
“stop list”.
It may be argued that due to the low semantic load of stop words, errors involving them
are of minimal importance. From the perspective of safety-critical domains like medicine,
however, accuracy is vital and stop word errors should not be considered exempt from error
calculations or detection. Seemingly inconsequential errors can ultimately impact the clini-
cian’s interpretation of a report and should be avoided at all costs. For example, important
information conveying the location of pathology is often communicated via prepositions,
such as “in the”, “on the”, et cetera. This means that statistical methods employing such
stop lists (i.e. most if not all) will be inherently restricted in their success. As a result, the
best method for error detection in radiology will involve a non-statistical or hybrid approach.
The theory surrounding such a system will be the topic of Chapter 4.
3.8 Summary
In summary, the methods applied to error detection in ASR can be classified into black box
(BB) and non-black-box (NBB) methods. These in turn can be further specified according
to their use of probabilistic and non-probabilistic techniques. With such a classification now
in place, it is now possible to put forward a new, hybrid BB method of error detection in
speech recognition within the context of radiology and with the goal of detecting errors at
the word level. In subsequent chapters, this new method will be introduced conceptually
and formally, and supported with a proof of concept.
Chapter 4
A Conceptual Model
Given the error classification discussed in Chapter 3, it is now possible to propose the follow-
ing original contribution: an error-detection methodology for the improvement of speech-
recognition output in radiology dictation. This chapter provides a conceptual introduction
to this model, its relation to the error-detection classification, and a formal definition. In
Chapter 5 a proof of concept will be provided via experimental evidence in the radiology
domain.
4.1 The General Idea
The overriding goal of this dissertation is not only to demonstrate that we can improve
the utility of ASR for radiologists, but to present a theoretical approach that does just
this. Central to this approach is the notion that by presenting radiologists with confidence
rankings on the ASR output, they will be able to proofread more efficiently through what
is essentially computer-aided document editing. This notion is supported by a recent study
in which Skantze and Edlund [135] demonstrate that human error-detection performance
improves when subjects are provided with a confidence ranking metric. In the study, this
metric was presented as a colour-coding on the words in the text based on the internal rank-
ing of the speech recognizer. A grey-scale representation ranging from dark grey, indicating
high recognizer confidence, to light grey, indicating low recognizer confidence, was used to
communicate these confidences to the user.
With this in mind, the goal of this dissertation can be formally stated:
47
CHAPTER 4. A CONCEPTUAL MODEL 48
Objective To develop a mapping from the individual words of a radiological free-text
report to a confidence ranking or error-tag set.
To achieve this mapping, an error-detection system must have some means for identifying
recognition errors within a text. This requires the identification of features whose values
will differentiate correct versus incorrect words. By relying on these features it is possible
to define different error-detection algorithms that may rely on different feature subsets, or
may differ in their feature handling.
Mapping the words of a text to an error-tag set provides a discrete indication of potential
errors or areas of low confidence. Essentially, words with confidence scores below a certain
threshold are flagged, while those above this threshold are not. From the perspective of
medicine all errors may be considered significant errors, and flagging all words equally with
confidence below the threshold may be more desirable.
Observation 1 The features of out-of-place words will be inconsistent with the expected
features of a word in that location.
For example, the probability of a word occurring given a particular context can be consid-
ered a feature of that word. A probability below a certain threshold, for instance, is not
consistent with the expected probability of a word in that location. Similarly, a word that
is syntactically out-of-place will not have the expected syntactic feature values. A hybrid
error detection approach utilizes as much information (i.e. features) from the text as pos-
sible through multiple detection algorithms to identify the maximum number of errors (of
varying type).
When working in radiology, we can take advantage of the constrained domain by defin-
ing features specific to radiology reports. For example, examining the various sections of
a standard radiology report reveals certain attributes that represent the expected features
of words occurring in those sections. Thus, the features of those words within the “Proce-
dures” section will relate to radiological procedures. Similarly, if a report is discussing an
examination of the knee, those concepts relating to the other parts of the body will have a
lower probability since the expected features will relate to the knee. When these and other
heuristics are combined together a characterization of each word or phrase in the report is
generated that can be used to calculate the degree of confidence in that word or phrase.
CHAPTER 4. A CONCEPTUAL MODEL 49
4.2 Introducing A Hybrid Approach to Error Detection
Observation 2 No single error-detection technique is sufficient to detect all potential errors
in a radiology report.
The goal in any post-recognition, error-detection system is 100% coverage of all error types
and 100% accuracy in identifying errors. The discussion in Chapter 3 shows that while the
various probabilistic and non-probabilistic methods of error-detection are each sensitive to
a particular subset of error types, none provide complete coverage over all types, nor do
any implementations achieve 100% or near-100% accuracy. In some cases, such as the use
of stop lists to omit stop words in statistical techniques, complete coverage is impossible.
Observation 3 By combining those methods of error detection that are complementary
in their coverage of error types, it is possible to achieve greater sensitivity to errors
within radiology reports.
Although individual error-detection techniques may be insufficient, if their coverage of error
types is shown to be complementary then the combination of multiple techniques via a
hybrid method will result in a higher coverage of error types. In addition, overlapping
areas of coverage will increase the reliability of each error mapping. Thus, the application
of complementary techniques in a hybrid approach will ensure maximum detection. In
this sense, the component error-detection algorithms can be considered as heuristics in the
hybrid system that improve the accuracy of the mapping of words into the error tag set.
Observation 4 Within the domain of radiology, black-box methods of error-detection are
the most viable.
Any solution to the problem of speech recognition in radiology must take into account the
potential variety of speech-recognition software currently in use. Not only is much of this
software proprietary, and therefore inaccessible to external interface, but variations in the
calculation of internal speech-utterance probabilities can affect any third-party software
designed to interface with these probabilities, creating inconsistent, unpredictable results.
Treating the ASR software as a black box and separating the error detection as a post-
recognition stage, however, avoids these problems and creates a second-level filter indepen-
dent from the speech recognizer. As mentioned in Chapter 1, this independence means that
CHAPTER 4. A CONCEPTUAL MODEL 50
a post-ASR error-detection system is not bound to a particular speech recognizer and there-
fore can be readily modified and updated. Furthermore, this avoids overspecializing, leaving
open the possibility of extending the methods to error detection beyond speech recognition.
Conclusion A black-box, hybrid approach to error detection is the best choice for an error-
detection methodology.
As an aside, in situations where the ASR software in question is non-proprietary or
otherwise accessible, or it is not necessary to have a system that is applicable to multiple
ASR implementations, the black-box restriction may be lifted in favour of a more general
hybrid approach.
On Tagging Errors – My Contribution
The ultimate goal of an error-detection system is a mapping function that when applied to a
text, such as a radiology report, will output a list of errors detected at the word level. This
list can be expressed superficially as a tag indicating that a word is “correct” or “incorrect”,
where “incorrect” means that a word can be described according to one of the error types
outlined in Chapter 3. Thus, all words are mapped to the error tag set {correct, incorrect},irrespective of their error type.
In a hybrid method, however, this mapping function relies on the interaction of the
component error-detection heuristics. There are two possibilities for arriving at the word-
level tag map. In the direct method, the indication of an erroneous word in a text by at least
one heuristic is sufficient to trigger an “incorrect” tag on that word. In the indirect method,
the output from each error-detection heuristic is taken as input to a meta-level heuristic.
Each word in a text is provided with a score based upon the weighted aggregation of any
scores assigned to that word from the error-detection heuristics. Since the error coverage
differs with each error-detection method, not all words may have scores assigned from all
algorithms. If the output from each algorithm is a measure of confidence in the recognizer
output, then the combined result of applying all heuristics via the hybrid algorithm to the
text results in a complex confidence score for each word. These results can be combined
in any variety of ways, with the choice of meta-level heuristic affecting the final confidence
rankings and the overall performance of the system. Given a threshold that controls the
degree of filtering, these scores can be translated into “correct” and “incorrect” tags based
CHAPTER 4. A CONCEPTUAL MODEL 51
on a word’s proximity to this threshold. The threshold is chosen in order to maximize the
accuracy of the tag maps.
The meta-level combination of the heuristics in a hybrid algorithm is a novel approach
to error detection in radiology reporting, and, to the author’s knowledge, ASR in gen-
eral. While some of the error-detection heuristics presented in this Chapter may take their
inspiration from previous research, the generation of a hybrid algorithm for radiology error-
detection based on the complementary strengths of the component algorithms is a completely
original contribution.
Creating a Single Confidence Score – Output Normalization
A naıve error-mapping function based on the direct method maps all words to “correct”
by default and “incorrect” if any error-detection heuristic indicates it as erroneous. Thus,
the error tag is based on the assignment of a single “incorrect” tag by any individual
heuristic, and not on the combined results of all heuristics returning an error value for that
word. While straightforward, such an approach does not take advantage of each heuristic’s
contribution to the assessment of the text and fails to exploit the differences in the nature
of the output from each. For example, a heuristic may have a high recall value, meaning
a high detection of actual errors in a text, but a low precision, meaning that it may also
return a high number of false positives. By using a cumulative value based on all heuristic
input, the results from the various algorithms may suppress the effect of a false positive
in the final output. This also makes it possible to represent the complicated relationships
between algorithms, such as the case of a heuristic that is particularly strong at detecting
one type of error. In this instance, an indication of an error of that type might be more
heavily weighted than an indication from another heuristic. In a similar fashion, multiple,
overlapping heuristics would act as backup measures, suppressing erroneous outliers and
increasing the reliability of the final mapping.
An example of a more intelligent mapping function might create a single confidence
score via a meta-analysis of the component heuristics. Such a meta-analysis will take into
account the individual effect each heuristic brings to bear on the overall confidence score.
A simple, meta-level algorithm then normalizes the results from each heuristic and averages
them to produce a final confidence score, as shown in Equation 4.1, where hi(x) represents
the normalized value of heuristic h as applied to word x and there are n heuristics.
CHAPTER 4. A CONCEPTUAL MODEL 52
c(x) =∑
i hi(x)n
(4.1)
A more complicated aggregation algorithm will weigh the effect of a heuristic’s output
value on the final confidence score, reflecting individual differences among the heuristics.
This could be applied globally to all results from that heuristic, or, as mentioned above, only
affect the weight of the confidence of those words whose error types are a particular strength
or weakness of the heuristic. In Equation 4.2, a global weighting schema is shown that applies
to all output from a particular heurisitc, where wi reflects the particular weighting of the
heuristic hi. Each heuristic may additionally harbour internal weighting schemes affecting
its own, interim output.
c(x) =∑
i
hi(x)wi (4.2)
It is not the case that an error-detection heuristic will necessarily map a word to a value
of a type that is compatible with the output types of the other heuristics. For instance,
one heuristic may map words to binary results, such as {correct, incorrect}, while another
may map from a more abstract level than the word level, such as the concept level, where
multiple words may represent a single concept and thus have a single error tag. In these
cases, we must normalize the output types into a common type suitable for aggregating
the individual heuristic results into a single confidence value for each word in the text
(or phrase, et cetera, depending on the chosen level of focus). In the concept-level error
detection example, this involves mapping concepts and their confidence scores back to the
individual words comprising those concepts, since the goal is error detection at the word
level. In the direct method, if all heuristic output translated to a word-level mapping
into {correct, incorrect}, then we can determine a final mapping for each word in the input
text. In the indirect method, the results from each heuristic are translated into a score. The
heuristics’ scores for each word are combined according to a weighting scheme to regulate
their effect to produce a final confidence score and/or tag.
Figure 4.1 provides an abstract representation of the hybrid approach. The filter repre-
sents the application of any weighting schemes applied to any particular heuristic output. It
is separate in the figure as it may have no effect on the input, which will be passed on to the
next stage of processing. From there, the output for each heuristic is normalized (converted
to a common type), and then combined to form an error tag (or confidence value) to word
CHAPTER 4. A CONCEPTUAL MODEL 53
Syntactic
Analysis
Semantic
Analysis
Word
Occurrence
Probabilities
Other
Heuristics
Filter
Convert
error types
and
combine
heuristic
output
Report
Final
Error
Mapping
Figure 4.1: The abstract hybrid system.
mapping for each word in the report.
4.3 A Note on the Measure of Correctness
This dissertation views the measure of correctness of a document as a direct consequence of
the word-error rate (WER) as shown in Equation 4.3, thus Cor(d) is the degree of correctness
of document d.
Cor(d) = 1−WER (4.3)
Alternative measures exist, such as the ratio of errors counted in a text to the number of
correct words counted, however, there is little evidence motivating the use of one measure
over another.
CHAPTER 4. A CONCEPTUAL MODEL 54
4.4 The Error-Detection Heuristics
A hybrid application of error-detection algorithms means increased sensitivity to errors and
error types. At the very least, the potential for 100% recall requires that the component
heuristics range over all error types as listed in Section 3.1.2. This means that regardless of
a hybrid method’s precision, it must at least be capable of detecting errors of all types.
The choice of heuristics for inclusion in the hybrid algorithm is an important question.
The motivation behind such a choice is twofold. First, choosing heuristics which are com-
plementary in their range of error types ensures that all types can be detected. Second, by
choosing algorithms with the greatest breadth of error type coverage at the relevant error
levels, the overlapping range of detection acts as a backup against false negatives. Together
these heuristics help to smooth out any weaknesses found within one approach and increase
the reliability of the output. Furthermore, since heuristics, by their very nature, are not
perfect methods, each heuristic aids in verifying and corroborating the results of the other
heuristics where their coverage overlaps.
Based on my review of the literature, presented in Chapter 3, I have selected the follow-
ing error-detection methods for their success in other applications (most notably dialogue
systems), their coverage of error report types, and their appropriateness for the radiology-
reporting domain. The intersection of the range of each heuristic’s output error types is
such that Structural, Syntactic, and Semantic errors are covered.
Together, the heuristics involve three levels of analysis:
Semantic analysis Semantic errors, generally covering all error types except stop words
and deletion errors.
Syntactic analysis Syntactic errors, generally covering all error types, including stop
words and deletions.
Word occurrence probabilities Semantic, syntactic and structural errors, generally cov-
ering all error types except stop words and deletion errors.
In addition to choosing a heuristic to cover each error type minimally, further heuristics
can be added to cover any error type in the interest of further insurance against system
errors.
CHAPTER 4. A CONCEPTUAL MODEL 55
4.4.1 Semantic Analysis
Since ASR works on the basis of the most probable translation of an audio signal to a word
in the lexicon, there is no restriction on the meaning of that word. Thus, in many cases even
if the word “giraffe” has exceptionally low probability of appearing in a radiology report, an
ASR system may still choose that word if it has the highest match probability with respect
to the audio input since it is a valid member of the lexicon. Such semantic errors can be
detected following recognition using techniques that rely on the meaning of words and how
those meanings constrain word use in a particular language or subset of language. In a
domain such as radiology, the concept base is more limited making it possible to perform
more in-depth semantic analysis.
Observation 5 Concepts within a radiology report share a measurable degree of semantic
similarity.
Conceptual Similarity
Conceptual similarity is a high-level metric for determining the semantic relatedness (as
defined in Chapter 3) of two concepts. The aim is to exploit this similarity measure to
identify out-of-context or off-topic words or phrases that may indicate a recognition error.
Such measures are useful in constrained domains where language use is restricted to a subset
of that language. Given an ontological representation of the concepts within a domain, it
is possible to directly measure the distance between any two concepts, provided they occur
within that ontology. A more detailed discussion of this technique is available in 3.6.2. Such
a measure of distance is not intended as an exact measure, but rather an approximation of
semantic relatedness that is a consistent measure regardless of the concepts involved. Thus,
assumptions regarding the symmetrical nature of the distances between nodes within an
ontology should have little bearing on the overall result. When the distance between two
concepts lies beyond a certain threshold, this may be indicative of an error. To increase the
utility of this result, a weighting schema can be introduced to reflect the depth of concepts
within the tree. That is, a comparison of concepts at a relatively shallow depth should not
have as much impact on the confidence score as those at a deeper, more specified level.
The primary challenge in a conceptual-similarity metric as an error-detection heuristic
is the choice of concepts from a text to compare. Within radiology, at the report level, a
topic marker can be generated that reflects the overall topic of the report. For instance,
CHAPTER 4. A CONCEPTUAL MODEL 56
each radiology report focuses on a particular anatomical region, such as an MRI of the
knee. This can be used to set the topic marker to “knee” priming the system to expect
input relevant to that topic. Thus, concepts within the body of the report can be compared
to the topic marker to check for relevance. At a lower level, within the body of the report,
concepts occurring within the same context window can be compared. This context window
can be restricted to consider only those concepts within a certain radius, those within the
same section in the report, or even the set of all concepts within the report. If the ontology
supporting the similarity analysis can be shown to be complete with respect to the lexicon,
or contain a very high percentage of the lexical items most likely to appear in an actual
report, the inability to find a concept within that ontology can also be used as an indicator
of errors in the form of medically irrelevant concepts. The usefulness of such ontological
outliers can be improved if the concepts considered from the report are restricted to those
belonging to a more limited subset of the domain. The ontology is more likely to contain
these concepts, making the absence of a concept an informative measure.
Semantic Grammar
Observation 6 Semantic relationships between entities within a radiology report can be
exploited to identify likely error candidates.
A semantic grammar defines the rules of language based on the major semantic classes
within the domain of discourse [81]. Thus, instead of constraining words on the basis of the
syntactic, or structural, role they play, they are constrained on the basis of meaning, where
meaning is defined in terms of the semantic classes. For example, a semantic grammar for
a flight scheduling dialogue system might include the following query rule [81]:
InfoRequest → when does Flight arrive in City
Based on such a query, the parser is able to predict the semantic category of upcoming
words in a text. Therefore, when a word does not fit the expected category, it can be
flagged as a potential error. The result is a grammar of rules that are highly dependent on
a particular domain. Within radiology, however, this is appropriate, though it may hinder
expanding the error-detection system beyond the radiological domain later on as the rules
will be radiology-specific.
CHAPTER 4. A CONCEPTUAL MODEL 57
While the rules themselves are typically not hard to express in a semantic grammar, there
must exist a rule for each possible semantic pattern, and each possible syntactic form (for
example, the active and passive voice differ in their structure and arrangement of concepts
within). As a result the development of the grammar is a time-consuming process.
In addition to the general semantic rules, within all coherent texts it is possible to identify
semantic relationships existing between the concepts within the text. These relationships
can exist at multiple levels:
• Syntactic, or physical placement of one concept relative to another (or relative to
functional words such as prepositions).
• Semantic, such as thematic roles describing the expectations one word has of its ar-
guments1.
• Discourse levels, such as the relationships existing at the scale beyond the sentence.
Beyond these levels, relationships at levels of abstraction specific to the domain, such
as anatomy or causality, also exist. These describe domain specific constraints on the
relationships between concepts.
All of these levels of analysis provide information about the expected relationships be-
tween concepts and how those are expressed. For example, two concepts may be linked via a
limited selection of prepositions that define the nature of that relationship, such as a person
puts clothes on themselves, not in themselves. Identifying domain-specific archetypes can
be challenging, such as the thematic roles that help characterize radiology texts. A full
analysis of these relationships is beyond the scope of this thesis, however, their applicability
in future enhancements may include such analyses and is discussed in Chapter 6.
Unlike the conceptual-similarity analysis, which determines a quantitative measure of
similarity on the basis of the distance between two (or more) concepts in an ontology, a
semantic grammar constrains the meaning of words or concepts based upon the categories
to which they belong. While related, the former assigns a numerical value measuring relat-
edness, and the latter identifies those words or concepts whose semantic categories do not
match the expected categories.
1For example, a transitive verb requires one or more complements describing the objects on which itacts. These complements may be semantically restricted to animate objects, for example, in a non-fictionalsetting. Examples include “has qualifier”, “has role”, “pertains to”.
CHAPTER 4. A CONCEPTUAL MODEL 58
4.4.2 Syntactic Analysis
Parsing
Observation 7 The points of failure in a syntactic parse can be used to identify likely error
candidates.
Although statistical methods have dominated error detection in ASR, their use of stop
lists and surface-level analysis prevents such systems from achieving 100% accuracy. To fill
this gap, non-statistical methods can offer a more in-depth analysis of the features within a
text.
Syntactic recognition errors include words or phrases that are out of place with respect
to their syntactic placement. In a misrecognized text, for instance, a verb may occur in
the text where the syntactic analysis would predict a noun. It is possible to identify these
syntax errors and apply a weight to determine a confidence score for words within the text.
For example, the phrase “a tear at the cruciate ligament” may be misrecognized as “*a
tear at the crucial”. In the misrecognized sentence the word “crucial” is located where a
noun phrase is expected, in contrast with the correct sentence, which contains the noun
phrase “cruciate ligament” in this location. The lack of a noun phrase in the incorrect
sentence identifies a potential deletion (i.e. the correct word was deleted), a misrecognition
of “crucial”, or both.
Thus, as a component of the hybrid approach to error detection, a syntactic parser can
be used to identify syntactic errors, including those which involve stop words and deletions.
In addition, while native English speakers are unlikely to make grammatical errors, those
who have learned English as a non-native language may have some problems with grammar.
These mistakes can also be detected via syntactic analysis. The sensitivity to syntactic errors
can be adjusted using a meta-level heuristic, which controls the effect each error-detection
heuristic brings to bear on the final analysis as mentioned in Section 4.2.
In radiology, information is often recorded using incomplete sentences or “bulleted” form.
In such cases the system must be sufficiently flexible to allow for these looser constructions,
while of sufficient granularity to detect when a sentence fragment is likely ungrammatical.
CHAPTER 4. A CONCEPTUAL MODEL 59
4.4.3 Word Occurrence Probabilities and “N-gram” Models
“You shall know a word by the company it keeps”
–J. R. Firth, English linguist, A Synopsis of Linguistic Theory, 1957
For the purposes of this initial experiment, there are two probabilistic techniques that
have been developed for the proposed hybrid method, namely co-occurrence relations and
Pointwise Mutual Information (PMI). Underlying these approaches is the key notion that by
identifying patterns common to error-free reports, we can automatically detect inaccuracies
within novel reports.
The first technique is based on my earlier work described in Voll et al [154] in which
co-occurrence relations [81, 96, 131] were found to have a high recall in detecting errors in
radiology reports. Given a sufficiently representative training corpus, words are associated
with particular contexts based on that corpus. These word-context statistics are then applied
to determine the probability of a word occurring in a given context in a report. This
probability represents a measure of the confidence in that word; if it falls below a certain
threshold the word will be flagged as a possible error.
The second technique is based on the work of Inkpen and Desilets who suggest similar
results using PMI. They also discuss other techniques previously employed, but conclude
that PMI performs the best, in part because of the potential to scale up well to larger
databases (which is ultimately desired for better characterization of radiology reports) [75,
149].
By choosing two statistical algorithms, the results can be combined via the indirect
method2 to smooth out any anomalies within the calculations themselves to produce more
reliable results. The results can also be used towards a comparative evaluation of the two
techniques.
The Context Window
Statistical techniques to error detection rely on the properties of the environment in which
a word occurs. This environment can be defined in a variety of ways, from simply any word
in the neighbourhood of the target word, to only those words of similar type (such as all
nouns or verbs), to words meeting other criteria in common with the target word. A word’s
2See Section 4.2.
CHAPTER 4. A CONCEPTUAL MODEL 60
neighbourhood is called the “context window” and refers to those words that co-occur with
the target word in the text. The context window can be any size measured as the n words to
the left and right of the target word, where n can be the size of the entire text or as small as
a single word and does not necessarily refer to consecutive surrounding words. The choice
of size can have an impact on the accuracy and generality of certain statistics. For certain
statistics, if the window is too small for a sufficient sample, the feature will be inaccurately
represented. Similarly, if the window is too large there is risk of a “cross- pollination” of
features that interfere with the statistics. From an efficiency perspective, a large window size
can introduce issues with respect to tractability. Methods relying on the context in which a
word occurs are often referred to as N-gram methods, essentially corpus-based probabilistic
models of a text. Recall that a “unigram” refers to the word itself, a “bigram” to a two-word
pairing, and so on. Based on these models, it is possible to build the statistical estimators
to determine probability estimates for words or features in a text [96].
The Training Corpus
A training corpus is a corpus or body of text intended as a representative sample of a
language. This language can be as broad as an entire natural language, such as English,
or restricted to particular domains of discourse within that language, such as medicine. If
such a corpus is representative of the language, then the statistical properties of this limited
sample set can be generalized to the entire domain.
The immediate challenge facing corpus-based linguistics is the notion of what constitutes
a “representative” sampling of the domain. In short, within a truly representative corpus
any properties observed must be extensible to the entire domain. Unfortunately, it is often
the case that researchers relying on such corpora are unable to choose the sample set, or
cannot identify it as truly representative. Thus, the adage “more is better” applies here,
with the idea that the larger the sample set the more likely it is to represent the domain. The
tradeoff is tractability. In addition, a sample set will be more representative of a smaller
domain than a larger one. In general, however, we must be aware that any statistical
analyses based on a limited corpus may introduce errors when extended to the full domain
[96]. A discussion of smoothing to avoid problems of data sparseness in corpora is provided
in Section 6.8.
In error detection, the statistical property most useful is the probability of a word
occurring in a text. This can be estimated from a training corpus by counting the number
CHAPTER 4. A CONCEPTUAL MODEL 61
of occurrences within that corpus and dividing by the total number of words in the corpus.
Similarly, given a context window of size n, it is also possible to calculate the probability of
a word occurring with any word from that context window. This results in a database of
two-word probabilities; the probability of any two words occurring together within a given
context-window size.
Co-occurrence Relations
Observation 8 The probability of a misrecognized term occurring in a radiology report is
lower than the probability of a correctly recognized term.
As discussed in Chapter 3, co-occurrence relations can be a reliable indicator of the
probability of a word in the context of report. Conceptually, the probability of a word
occurring independently is combined with the probability of it occurring in a given context.
Using the two-word statistics generated from the training corpus it is possible to combine
these results to generate a single probability score for the target word.
One means for determining the probability of a word given its context words is Bayes’
Theorem. Bayes’ Theorem evaluates the most probable hypothesis based upon the observed
information so far [102]. This can be applied to error detection by considering a word’s
occurrence in a report as a hypothetical statement about the world of radiology reports.
Similarly, the context of that word can be viewed as the observed data so far. Thus, given
a word, x, and a list of context words for x, C, Bayes’ Theorem is defined as follows [102]:
P (x|C) =P (x) ∗ P (C|x)
P (C)(4.4)
In Equation 4.4 P (x) is referred to as the prior probability, that is the probability of
the word x occurring regardless of its context. This is the probability of x occurring in the
training corpus. In contrast, the desired quantity, P (x|C) is the posterior probability, the
probability that x will occur given its context is C. To arrive at a value for P (x|C), the
prior probability as well as the independent probability of the context P (C) and finally the
probability of the context occurring given that x does occur in that context, P (C|x), are
combined [102]. The denominator, P (C), is a normalization factor. Since it is a constant
and assumed to be independent of the target word x for the purposes of this calculation it
can be dropped.
CHAPTER 4. A CONCEPTUAL MODEL 62
While the probability of any individual word is simply its occurrence in the training
corpus divided by the number of words in that corpus, the probability of a given context
(a set of words) is more difficult. Thus, the calculation of P (C|x) is also complex. This is
handled by the principle of Joint Probability shown in Equation 4.5.
P (C|x) = P (x)P (C1, ..., Cn|x) By Joint Probability
= P (x)P (C1|x) ∗ ... ∗ P (Cn|x) (4.5)
= p(x)n∏
i=1
p(Ci|x)
Bayes’ Theorem is a straightforward approach to determining the probability of a word
given its context, although other methods could be easily applied here.
Pointwise Mutual Information
Like co-occurrence relations, the PMI value of a word and its surrounding context can be
a useful measure of its likelihood of being correct. Again, by determining a context for
each word, it is possible to use the probability-statistics generated for the training corpus to
calculate the probabilities necessary for the PMI calculation. For the complex calculation
of a word and its context window P (x,C), any number of aggregation techniques can be
applied. The simplest of these is an average over the individual probability of x occurring
with each word in C. Inkpen and Desilets [75] looked at three aggregation techniques, for
which they found that averaging performed slightly better. Thus, for simplicity, this will be
the method of choice for application of PMI to radiology-text error detection.
4.5 A Formalization of the Hybrid Approach to Error
Detection in Radiology
Given the discussion in the first half of this chapter, it is now possible to state a formalized
theory of hybrid error detection in radiology. To help the reader, the following formalizations
are each augmented with an English gloss and, wherever possible, examples from the domain.
CHAPTER 4. A CONCEPTUAL MODEL 63
4.5.1 General Definitions
Let Z be the set of integers.
Let N be the set of natural numbers.
Let L be a lexicon of words in the English language. Since theoretically a report can
contain any English word, L is not restricted to any subset of English.
Let R be a tuple s.t. R ∈ L~. Here L~ is defined as follows:
L~ = {(x1, · · · , xn)∣∣n ≥ 1 ∧ xi ∈ L ∧ 1 ≤ i ≤ n} (4.6)
In other words, L~ is the set of all possible tuples created over the lexicon L, while R
is a tuple from this set representing a natural language, free-text, radiology report. Note
that the remainder of this chapter adopts the notation convention where a report of size n
is denoted Rn.
Let win be an integer representing the context-window size, s.t. win ∈ N.
Let th be an integer representing a threshold number, s.t. th ∈ Z. The threshold is
a constant that controls the degree of filtering. Recall that the error-detection algorithm
returns a value representing the confidence assigned each word in a report. If, for any word,
that confidence value exceeds the the threshold, it will be tagged as an error.
Let EDA be a set of functions representing the error-detection heuristics (each of these
will be subsequently defined):
EDA = {parser, cor, pmir, sdr} (4.7)
In the usual notion of a set, any duplicate members within that set are not considered
distinct and therefore cannot be counted (as is needed in our probability calculations). A
multiset or mset, however, is defined as a “set” in which repeated elements are allowed [12].
Within set theory, we can more precisely define an mset M as a pair (A,m) where A is
some set and m : A → N is the multiplicity function. The set A is called the underlying set
and is defined as U(M), while m(M) defines a mapping from elements in A, to the number
of times they occur in M .
An mset is frequently written as a set of ordered pairs where the first element is the
underlying set, and the second element is the definition of the multiplicity function. For
example, the mset {a, a, b, c, c, c} is the “set” containing 2 a’s, 1 b, and 3 c’s, which is defined
CHAPTER 4. A CONCEPTUAL MODEL 64
as ({a, b, c}, {(a, 2), (b, 1), (c, 3)}) and where U(M) = {a, b, c}.Let TC be a set of radiology reports that represents the training corpus, s.t. TC ⊆ L~.
Let TW be an mset where U(TW ) = {w1, · · · , wn} and wi ∈ TC (s.t. 1 ≤ i ≤ n).
Calculating co-occurrences:
The co-occurrences in a given report Rn represent a symmetric context window: the set
of tuples defined by pairing a word x in Rn with each of the win words occurring to the left
of x, and the win words occurring to the right of x.
The following preliminary functions are needed to define co-occurrences formally. Con-
sider a report Rn. Let t represent the index of a given word in Rn, called the target word
(this convention will be continued through the remainder of this chapter). The functions
before and after determine the win words occurring before and after the target word t,
respectively. Taking the target word wt, report Rn, and the windowsize win, each function
returns either the list of win words before wt in report Rn, or the list of win words after wt
in Rn.
Let before : L~ × N× N → 2L be defined as follows:
before(Rn, t, win) = {xi
∣∣xi ∈ Rn ∧max(1, (t− win)) ≤ i < t} (4.8)
Let after : L~ × N× N → 2L be defined as follows:
after(Rn, t, win) = {xi
∣∣xi ∈ Rn ∧ t < i ≤ min((t + win), n)} (4.9)
Restricting the boundaries of i by max and min in before and after, respectively,
accounts for target words occurring near the beginning or end of a report (which may cause
the total number of words in the context window to be less than 2win since there will be
fewer words returned by before or after).
For example, given the following trivial report (the indices are added for clarity):
R test5 = (the0, xray1, shows2, nothing3, abnormal4)
CHAPTER 4. A CONCEPTUAL MODEL 65
From R test5 it is possible to calculate the following for the target word xt = shows2 (t = 2):
before(R test5, 2, 2) = {the0, xray1}
after(R test5, 2, 2) = {nothing3, abnormal4}
The definitions of before and after can be used to determine the co-occurrence relations.
The function co will take a report Rn (of size n), a target-word indexed by t, and a window
size win, and return all co-occurrence pairs that occur in Rn. That is, all pairs where xt is
the first element, and the second element is from the set of words that occur win words to
the left and win words to the right of xt.
Let co : L~ × N× N → 2L2be defined as follows:
co(Rn, t, win) ={(xt, xi)∣∣xi, xt ∈ Rn ∧ xi ∈ before(Rn, t, win)} ∪
{(xt, xi)∣∣xi, xt ∈ Rn ∧ xi ∈ after(Rn, t, win)}
(4.10)
For example, given R test5 above (where x2 = shows2):
co(R test5, 2, 2) = {(shows2, the0), (shows2, xray1), (shows2, nothing3), (shows2, abnormal4)}
Next is needed a function to define the co-occurrences as defined over the training corpus,
TC.
Let tcs be the number of reports in TC.
Let ni be the number of words in a report TCi.
Let trainingCOs : 2U(TW )~ × N → 2L2 × N be a function generating an mset of co-
occurrences based upon the training corpus, TC:
trainingCOs(TC,win) = {tcs⋃i=1
ni⋃t=0
co(TCi, t, win)} (4.11)
For example, consider the following trivial case:
TCtest = {(the0, xray1, shows2, nothing3, abnormal4), (the0,mri1, is2, unremarkable3)}
It is possible to determine the following, based upon the definitions so far (where TCi refers
CHAPTER 4. A CONCEPTUAL MODEL 66
to the ith report in TCtest) with a sample window size of 2 (win = 2):
co(TC1, 0, 2) ={(the0, xray1), (the0, shows2)}
co(TC1, 1, 2) ={(xray1, the0), (xray1, shows2), (xray1, nothing3)}
co(TC1, 2, 2) ={(shows2, the0), (shows0, xray1), (shows2, nothing3),
(shows2, abnormal4)}
co(TC1, 3, 2) ={(nothing3, xray1), (nothing3, shows2), (nothing3, abnormal4)}
co(TC1, 4, 2) ={(abnormal4, shows2), (abnormal4, nothing3)}
A similar result is obtained for TC2. Given this, trainingCOs(TCtest, 2) produces the
following mset as the combination of the results from TC1 and TC2 as per our definition of
trainingCOs:
{co(TC1, 0, 2) ∪ co(TC1, 1, 2) ∪ co(TC1, 2, 2)∪
co(TC1, 3, 2) ∪ co(TC1, 4, 2) ∪ co(TC2, 0, 2)∪
co(TC2, 1, 2) ∪ co(TC2, 2, 2) ∪ co(TC2, 3, 2)}
The above mset is defined by the set of ordered pairs consisting of a tuple, representing the
co-occurrence, and a cardinal number, representing the count of the number of times that
co-occurrence occurs in TC (over a window size win).
Calculating Probability
Recall that the probability of an element with respect to a corpus is the number of times
that element occurs in that corpus divided by the size of the corpus.
Let countPair : 2L2 × L~ × N be the number of times that the pair (xi, xj) occurs in
the mset (trainingCOs(TC,win),m) (Recall m from our definition of mset above).
countPair((xi, xj),TC,win) = {n∣∣((xi, xj), n) ∈ m(trainingCOs(TC,win))} (4.12)
Similarly, let countWord : 2L2 ×L~ ×N be the number of times that a word, xi, occurs
in the mset TW .
countWord(xi, TW ) = {n∣∣(xi, n) ∈ m(TW )} (4.13)
CHAPTER 4. A CONCEPTUAL MODEL 67
Let p1 : L → R be a function representing the probability of an element xi occurring in
the training corpus words TW .
p1(xi) =
0 countWord(xi, TW ) = 0countWord(xi,TW )
|TW | countWord(xi, TW ) > 0(4.14)
Note that in the first case, when xi 6∈ TW , the probability of that word is zero.
Let p2 : L2×N → R be a function representing the probability of a pair (xi, xj) occurring
in the training corpus co-occurrences, as defined by the windowsize win.
p2((xi, xj), win) =0 countPair((xi, xj), trainingCOs(TC,win)) = 0countPair((xi,xj),trainingCOs(TC,win))
|trainingCOs(TC,win)| countPair((xi, xj), trainingCOs(TC,win)) > 0
(4.15)
Similar to the function p1, in p2 the first case captures when the co-occurrence is not in the
set of co-occurrences defined over the mset defined by trainingCOs(TC,win)), and thus
the probability of that co-occurrence is zero.
Calculating PMI
The function pmi calculates all co-occurrence pairs for Rn (a report of size n) given
an index for a word within that report t, and a windowsize win. It then applies the PMI
calculation3 to those pairs, and returns a real number ri for each such pair. Note that
all pairs will have as their first element the word at index t based upon the definition of
co-occurrence.
As before, let Rn be a report of size n, let t be the index of a word in Rn, and win the
window size such that 0 ≤ win ≤ n. Then, let pmi : L~ × N × N → Rn be defined as the
3From Section 3.6.1.
CHAPTER 4. A CONCEPTUAL MODEL 68
following.
pmi(Rn, t, win) = {ri
∣∣(xt, xi) ∈ con(Rn, t, win)∧
ri =p2((xt, xi), win)p1(xt)× p(xi)
∧
1 ≤ i ≤ n}
(4.16)
Applying Bayes’ Theorem
Given the definitions of Bayes’ Theorem previously in this chapter (see Equations 4.4 and
4.5), the following formal definitions are possible.
The function bt takes a word x, and a list of words {y1, · · · , yn}, representing all words
with which x co-occurs in some report (that is, x’s context), and returns a real number
representing the probability of that word occurring in that context. Note that there is no
denominator (compare to Equation 4.4). This is because the denominator would represent
the probability of observing the context of x, namely p({y1, · · · , yn}), which in this case is
1 since the context has already been observed. Thus it has been omitted.
Let x be a word and let {y1, · · · , yn} the set of context words for x. Then, let bt :
L×Ln×N → R be a function for applying Bayes’ Theorem to a word and its context, given
a particular window size.
bt(x, {y1, · · · , yn}, win) = p1(x)×n∏
i=1
p2((x, yi), win) (4.17)
The function context calculates the context-window words in which a word, xt, occurs
at the middle in a report of size n, Rn.
Let context : L~ × N× N → 2L be defined as the following.
context(Rn, t, win) = {xi
∣∣(xt, xi) ∈ con(Rn, t, win)
1 ≤ t ≤ n}(4.18)
For example, recall the test report, R test5 = (the0, xray1, shows2, nothing3, abnormal4):
context(R test5, 2, 2) = {the0, xray1, nothing3, abnormal4}
CHAPTER 4. A CONCEPTUAL MODEL 69
4.5.2 The Error-Detection Algorithm
With the above definitions, the individual error-detection heuristics can now be formalized.
Co-Occurrence Report Analyser
The function cor applies a co-occurrence analysis on a report Rn, and returns the prob-
ability of all words within that report, based on their occurrence in TC. Here Bayes’ is used
to aggregate the results of the co-occurrence analysis (co(zi, win)) on the context window
of each word xi into a single value. This value is then compared to a threshold th, and only
those results for which the value falls below the threshold are returned as an error.
Let cor : L~ × N → 2R be defined as the following.
cor(Rn, win) = {xi
∣∣xi ∈ Rn∧
ri = bt(xi, context(Rn, i, win), win) ∧
ri ≤ th}
(4.19)
Pointwise-Mutual-Information Report Analyser
Let aggregate : L~ → R be a function which collects the results from applying pmi to a
report according to some means for collecting the results into one value. As there are man
approaches to such aggregation, the specifics are not defined here4.
The function pmir determines an aggregated PMI score for each word xi in the report
Rn and xi’s context. The PMI score is then compared to a threshold, and only those results
whose value falls below the threshold are returned as an error.
Let pmir : L~ × N → L be defined as follows.
pmir(Rn, win) = {xi
∣∣xi ∈ Rn ∧
zi = aggregate(pmi(Rn, i, win)
)∧
zi ≤ th}
(4.20)
4In Inkpen and Desilets, the authors discussed several options for aggregating PMI results; in this disser-tation, for instance, the results are simply summed and averaged.
CHAPTER 4. A CONCEPTUAL MODEL 70
Syntactic Parser
The actual implementation specifics of the parser are not important here, as any syntactic
parser implementation would be considered functionally equivalent, provided the following
(more general) definitions still hold. An example implementation is provided in Chapter 5.
Let Sn = (x1, · · · , xn) and xi ∈ L be a tuple of words of size n representing an English
sentence
Let parse be some relation between a sentence S and those subsequences of S which cor-
respond to constructions or constituents within the grammar defined by the natural language
being used (that is, those captured by V alidEnglishConstituents, where “constituent” is
a functional unit of one or more words in a language5).
Let sent : L~ → L~ be a function that defines all of the valid English sentences within
a report, where a sentence is a subsequence of a report (tuple).
sent(Rn) = {Si
∣∣Si ⊆ Rn} (4.21)
Let getErrors be a function mapping a parse relation to a set of words representing
errors. Again, this is only hollowly defined as it may vary depending on one’s method of
parser, or desired method of error collection based upon the parser.
Let parser : L~ → L be a function that given a report Rn, returns those words that
are considered errors (based on some function getErrors above). Here s is defined as the
number of sentences in Rn.
parser(Rn) =s⋃
j=1
{xi
∣∣xi ∈ getErrors(parse(Sj)) ∧ Sj ∈ sent(Rn)} (4.22)
The function parser takes a radiology report Rn and collects the union of all errors
returned for every sentence Si found within that report.
Semantic Distancer
The semantic distancer is a conceptual formalization of the semantic-similarity measure. If
the ontology being used is considered as a directed graph, the following definitions hold (as
in the case of the UMLS):
5The reader is directed to Appendix A for more information on constituents.
CHAPTER 4. A CONCEPTUAL MODEL 71
Let V ∈ L~ be a set of vertices (concepts).
Let E be a set of edges (tuples) of the form (x, y) where x is directed to y, s.t. x, y ∈ V .
Let G be a graph s.t. G = (V,E).
Let Distance : L~ × L~ → N be a relation that returns the length of a path between
any two vertices, x ∈ V and y ∈ V , in a graph G.
Let C ∈ L~ be a tuple of words representing a concept. Note that a concept can be
comprised of more than one word. For example, “radiology report” is two words, but may
be represented by a single concept (i.e. tuple) containing both words.
Let Concepts be the set of all concepts C in the domain as defined over L~.
The reportConcepts function maps the words within a report (a tuple of words) to those
subsequences of that tuple which correspond to concepts. That is, those tuples which are
contained in the set Concepts.
Let reportConcepts : L~ → 2L~be a function defined as follows.
reportConcepts(Rn) ={(xi, · · · , xj)∣∣Rn = (x1, · · · , xn) ∧
(xi, · · · , xj) ∈ Concepts∧
1 ≤ i ≤ j ≤ n}
(4.23)
It is also possible to calculate the co-occurrences of one concept with respect to another.
Thus, the following functions are modifications of the functions before, after, and co so
that they now apply to concepts:
Let CSc represent a set of concepts of size c, obtained via reportConcepts(Rn) for some
report Rn.
Let concept before : 2L~ × N× N → 2L~be defined as follows.
concept before(CSc, t, win) = {ci
∣∣ci ∈ CSc ∧max(1, (t− win)) ≤ i < t} (4.24)
Let concept after : 2L~ × N× N → 2L~be defined as follows.
concept after(CSc, t, win) = {ci
∣∣ci ∈ CSc ∧ t < i ≤ min((t + win),m)} (4.25)
CHAPTER 4. A CONCEPTUAL MODEL 72
Let co : 2L~ × N× N → 2L2be defined as follows.
concept co(CSc, t, win) = {(ct, ci)∣∣ci ∈ CSc ∧
ci ∈ concept before(CSc, t, win)} ∪
{(ct, ci)∣∣ci ∈ CSc ∧
ci ∈ concept after(CSc, t, win)}
(4.26)
The function sd determines the semantic distance of all concepts within a report, Rn,
that are up to win concepts away from the concept indexed at t. The value weight represents
an optional weight factor that may be applied to reflect the varying strength of certain edges
(as discussed in Section 3.6.2).
Let sd : 2L~ × N× N → R be defined as follows.
sd(CSc, t, win) = {zi
∣∣(ct, ci) ∈ concept co(CSc, t, win) ∧
zi = Path(ct, ci)× weight}(4.27)
The function sdr takes a report, Rn and a window size, win, and determines the set
of semantic distance values for all concepts within the report. It returns the set of those
concepts whose semantic distance values are equal to or below the threshold value.
Let sdr : L~ × N → L~ be defined as follows.
sdr(Rn, win) ={ci
∣∣ci ∈ reportConcepts(Rn) ∧
zi ∈ sd(reportConcepts(Rn), i, win) ∧
zi ≤ th}
(4.28)
Lastly, let wordmap : L~n → 2Lnbe a function that maps a list of concepts CS
(represented as word tuples, recall) to a list of the individual words within each concept
{w1, · · · , wn}, where n = |CS|. Since a concept can be comprised of more than one word,
this mapping is necessary to identify the individual word errors (since the system in question
reports errors at the word level). For example:
wordmap({(radiology, report)}, {lesion}
)=
{{radiology, report}, {lesion}
}(4.29)
CHAPTER 4. A CONCEPTUAL MODEL 73
FP TP FN
ER
(Set of errors
detected)
AE
(Set of actual
errors)
Figure 4.2: A Venn diagram showing the similarities between ER and AE.
Errors Detected
ER is defined as the set of errors detected:
ER = pmi(Rn, win) ∪ parser(Rn) ∪ cor(Rn, win) ∪ wordmap(sdr(Rn, win)
)(4.30)
C is defined as the set of correct words:
C = R ∩ ER (4.31)
AE ∈ 2L is defined as a set of words representing the set of actual erroneous words in a
report. The following definitions are then possible:
• FP is defined as a set of words representing the false positives, s.t. FP = ER ∩ AE
and FP ∈ 2L.
• FN is defined as a set of words representing the false negatives, s.t. FN = AE ∩ER
and FN ∈ 2L.
• TP is defined as a set of words representing the true positives, s.t. TP = AE ∩ ER
and TP ∈ 2L.
Figure 4.2 shows a Venn diagram highlighting the relationship between the two sets AE and
ER, and a visual representation of FP , TP , and FN .
CHAPTER 4. A CONCEPTUAL MODEL 74
4.6 Summary
This chapter has presented a hybrid, black-box-based error-detection method for ASR in
radiology. The observations provided in this chapter and the error-detection classification
laid out in Chapter 3 demonstrate a robust system that will capitalize on the strengths
of the heuristics when applied together on the same document. In the following chapter, a
series of experiments will be presented as proof of concept showing the system’s viability and
performance, including an increase in detection accuracy over any independent heuristic.
Chapter 5
Experimental Evidence
In this chapter the problem of error detection in radiology is viewed from an experimental
perspective. The heuristics outlined in the previous chapter are implemented to demonstrate
their efficacy as independent error-detection methods, and finally combined as proof of
concept of the hybrid approach.
5.1 Introduction to Proposed System
To demonstrate the viability of the methodology proposed in Chapter 4, the implementation
in this chapter has been designed as a proof of concept. The combined performance of the
error-detection heuristics is sufficient to support the thesis that error-detection is capable
of improving the performance of ASR in radiology, and likewise, the conclusion in Chapter
4 that a hybrid method will outperform any single method. On a larger scale, and beyond
the scope of this dissertation, the full error-detection system will provide an interface for an
interactive review of the report summary as well as the confidence scores. The radiologist
will be able to efficiently correct the erroneous input from this interface by concentrating on
words tagged with confidence scores below a certain threshold, while skimming those above
this threshold. This can be further extended by intelligently suggested corrections. The
interface will also present the option of searching the database of existing reports. This will
be set up to facilitate extension beyond the local database to Intra- and Internet searches
as well. These extensions and others are explored in Chapter 6.
75
CHAPTER 5. EXPERIMENTAL EVIDENCE 76
5.1.1 Materials
Corpora
The Training Corpus This proof of concept relies on the availability of radiology
reports, collected via speech recognition, to design, train, and test the system. The Canada
Diagnostic Centre (CDC) in Vancouver, BC, has provided 2751 corrected and de-identified
radiology (MRI) reports obtained using the Dragon NaturallySpeaking speech-recognition
system, version 7.3. The co-occurrence statistics of varying window sizes have also been
compiled for these reports.
Note that in these reports the “Techniques” section was provided as a template that the
user selected at the time of dictation. As a result, this section is not susceptible to errors
introduced as a result of ASR and is not used in any of the upcoming studies.
The Test Corpus Since the 2751 radiology reports from the CDC have been corrected,
they cannot be used to test the error-detection system. In response to this Dr. Bruce Forster
of the CDC has suggested an arrangement whereby raw, uncorrected reports can be obtained
from the CDC along with their corrected counterparts. Thus, an additional corpus of 30
raw, uncorrected radiology reports paired with their corrected versions was collected. Since
these reports were part of an ongoing collection by the CDC, they were produced only when
time was available, and include all scan types, unlike the training data, which is limited to
MRI. The presence of other scan types (such as CT) in the test data will influence the final
results via the system’s ability to successfully generalize beyond MRI reports. Arguably,
the resulting vocabulary variation between modalities should be minimal since much of the
radiological parlance overlaps. This is discussed in Section 5.3. Out of these test reports,
there is an average of 11.9 errors per report, with an average report length of 80.8 words.
This represents an average word-error rate (WER) of 15%.
In developing an adequate database of test reports pairs (that is, raw and corrected
report pairs) an initial attempt was made to re-dictate a series of corrected exams for which
the raw report was no longer available. Dr. Forster assisted in this process by reading
from a print-out of the report in question. Interestingly, it was found that the ASR system
performed surprisingly well on these reports. The speculation is that the cadence when
reading from a printed report is significantly different than when dictating “on the fly”.
Consequently, dictation is smoother, with less false starts or filler words such as “um” or
CHAPTER 5. EXPERIMENTAL EVIDENCE 77
“er”. As a result, such a method is not viable for creating a realistic test corpus.
As with the training corpus, the “Techniques” section of a report is ignored.
Ontological Knowledge Source
The semantic analysis portion of this work requires access to an ontological knowledge
source. Based on the discussion in Appendix B, the Unified Medical Language System
(UMLS) has been chosen for this purpose1. Briefly, the UMLS is developed by the Na-
tional Library of Medicine (NLM) with the intent to facilitate automated natural language
understanding in medicine. It comprises three knowledge sources, the Metathesaurus, the
Semantic Network, and the SPECIALIST lexicon. The Metathesaurus is an ontological
source of knowledge built upon many source vocabularies that have been combined into a
single database. Within this database concepts are organized by their relationship to other
concepts, such as, for example, the “is-a” relation. The Semantic Network provides a general
and overriding conceptualization of all concepts and their relationships within the Metathe-
saurus, regardless of their source vocabulary. Each concept within the Metathesaurus is
linked to one of the abstract concepts, called “semantic types”, within the Semantic Net-
work. These semantic types represent major groupings, split at the most general level into
event and entity. Finally, the SPECIALIST lexicon is a database of lexical information
useful for natural language processing. The terms found within the Metathesaurus and
Semantic Network, for example, are found within the SPECIALIST lexicon.
5.2 Methods
5.2.1 Modular Design
As a software-engineering methodology, the modular design of the hybrid algorithm has a
number of advantages over single techniques. First, it is possible to develop and evaluate
each heuristic incrementally and individually. In addition, many of the drawbacks applicable
to particular heuristics can be overcome by the combination of multiple results. For instance,
if it is not possible to obtain a sufficient training corpus for the purposes of co-occurrence
analysis, it is still possible to derive confidence rankings using the other heuristics. In the
1http://www.nlm.nih.gov/research/umls/ Accessed: February 2006; Updated: February 2006.
CHAPTER 5. EXPERIMENTAL EVIDENCE 78
long term, modularization lends itself to software reusability and the possibility of multiple
software developers, resulting in a more robust, usable system.
5.2.2 Calculating Results
To find the actual errors in our test reports, the corrected and uncorrected reports are
aligned and any differences are identified and tagged as errors. These are then compared
to the flagged errors from the program output to obtain the results: a match is considered
a correct detection, or true positive; a flagged error that does not correspond to an actual
error is considered a false positive; an error not flagged is considered a false negative.
In calculating the results for this experiment, Recall is a measure of the number of errors
correctly detected over the total number of errors actually present (how many actual errors
are found); Precision is a measure of the number of errors correctly detected over the total
number detected (how many of the errors found are actual errors).
5.2.3 Aligning the Source and Output: Recognition Errors
For the purposes of proof of concept, all errors tags were manually collected and recorded
based upon the output of the error-detection heuristics. While generally an objective pro-
cess, on occasion situations arose that required a choice between which words to count in
error, and how many errors to record. In many cases, such as split or merge errors, what
might appear as one or more errors in the output document, corresponded to a different
number of words in the source document. In striving for consistency between all error
determinations, the following conventions were adhered to: given a split error, the error
count remains one, regardless of the number of words the source word was erroneously split
into; given a merge error, the same process applies, regardless of the number of consecutive
words erroneously compressed from the source document. For example, “recognize” may be
misrecognized as “wreck a nice”, or vice versa. In either instance the error count is one.
In some cases, multiple, consecutive errors may be identified by the error-detection
system; occasionally these are the effect of cascading errors. In these situations it can be a
complicated task to align the source document with the output from the detection system,
resulting in some interpretation on the part of the human analyser. Where the errors
involved content and stop words it was often difficult to determine whether such errors
constituted more than one. In these cases tagged errors extending over six words (stop
CHAPTER 5. EXPERIMENTAL EVIDENCE 79
words included), were counted as two errors, and similarly for every three errors occurring
consecutively beyond that.
As a final note, all tools were designed and run on a Mac G4, 1.5 GHz, OS X 10.3.9.
5.2.4 Calculating Co-Occurrences
The generation of a word’s or concept’s context in terms of pairs of co-occurring words or
concepts is necessary throughout this research. Given a word, w, occurring in a document,
d, a context window, C(w, d, n), is defined as the n words occurring to either side of w in d.
This technique is used to generate the training corpora for various window sizes, as well
as to analyze the test cases and compare them to the training database. A sample selection
of the co-occurence relations for the word “quadriceps” from the training corpus is provided
in Table 5.1. For example, “quadriceps” occurred in the training corpus 123 times, and
co-occured with the term “patellar” 32 times, for a frequency of 32/123 = 0.26.
Table 5.1: Co-occurrence statistics for “quadriceps”.
term context count term freq.word count
quadriceps included 1 123 0.01quadriceps mechanism 1 123 0.01quadriceps patellar 32 123 0.26quadriceps tendon 38 123 0.31quadriceps tendons 50 123 0.41quadriceps vastus 1 123 0.01
As an example of the co-occurrences for a particular sentence, consider the incorrect
sentence fragment in Sentence 1:
...possible spondylolysis eye laterally of L5... (Sentence 1)
We can generate the following co-occurrences for the target word, “eye”, with a context
window of two (up to two words to either side of “eye”):
eye possible
eye spondylolysis
eye laterally
eye L5
CHAPTER 5. EXPERIMENTAL EVIDENCE 80
Note that a stop list, as discussion in Section 3.7 is employed in all statistical calculations
described here.
In the next sections, the individual heuristic implementations and their results are ex-
amined.
5.2.5 The Error-Detection Algorithms
The error-detection algorithms in the proof of concept were chosen to cover all recognition
error types, as per Section 3.1.2. Based on the study of error-detection methods in Chapter
3, these were inspired by algorithms shown to have some success in other domains, as well
as original algorithms based on unique adaptations of other natural language processing
techniques.
There are a number of potential ways in which information about the likelihood of an
error can be determined. The aim is to explore, develop and evaluate as many error-detection
heurisitcs as possible for use in this system. These include the following:
• Conceptual/semantic similarity.
• Semantic relationships, such as thematic roles and levels of abstraction.
• Syntactic analysis.
• Word occurrence probabilities.
5.2.6 Conceptual Similarity
The Semantic Distancer
Overview The method of conceptual similarity developed here was inspired by the
work of Rada and Blettner [113] and Caviedes and Cimino [23]. It is a simple system that
identifies the concepts within a radiology report by applying the NLM’s MMTx software.
The average distance each concept is from its context, and from the general topic of the
report (for example, anatomical region of study) is determined using the UMLS. The final
result is a confidence ranking of the concept itself. If a concept differs too drastically from
the surrounding concepts or the topic marker (i.e. the distance exceeds a certain threshold),
it is considered a recognition-error candidate.
CHAPTER 5. EXPERIMENTAL EVIDENCE 81
Materials Central to the functioning of the semantic distancer is an ontological knowl-
edge base and a means for extracting and mapping concepts within radiology reports to this
ontology. The UMLS, v2005AB2 (see Section 5.1.1), and its corresponding MMTx soft-
ware, a program that maps biomedical text into UMLS concepts, have been chosen for this
purpose.
The MMTx is a Java implementation of the original MetaMap software intended for
public access3. Based on the same algorithms as MetaMap, in general, MMTx produces
equivalent output. Known differences stem from the use of a third-party tagger in MMTx,
but are not considered relevant to this work.
The MetaMap algorithm applies a shallow parse to an input text, and uses the resulting
phrases to determine all variants of the terms within each phrase from the SPECIALIST
lexicon. The Metathesaurus is then consulted to generate a candidate list of all concepts
that match those variants. The candidate list is ranked based upon the weighted average
of four metrics, including the degree of variation between the variant and the original term,
and the degree of match between the candidate and the text [4, 3]. The output is a list of
the top candidates ordered by match strength.
The UMLS was obtained via DVD directly from the NLM. Those files maintaining the
inter-conceptual relationships (e.g. parent and sibling relations) were transferred into a
local-access database using MySQL, v5.0.16. Note that all text manipulation and reformat-
ting of the UMLS and of the radiology reports was done in Perl, v5.8.1-RC3.
The entire corpus of 30 test reports is used in this experiment. The training corpus was
not used.
Method In order to determine the relevant concepts for analysis, each report is first
run through the MMTx software to produce a Prolog-compatible output list of concept
candidates. To simplify this preliminary implementation, only the top concept candidates
are kept in all cases. Without re-ranking the candidate list, this seems the best course
of action in lieu of testing multiple candidates for each concept in the text, which would
quickly result in an exponential growth of the search space. This leaves open the possibility
of incorporating these candidate lists more fully into the analysis and is discussed in Section
22005AC was released during the course of this research, however it was decided that due to changes inthe lexicon that an upgrade would risk further inconsistency in the results and was not necessary.
3http://mmtx.nlm.nih.gov/FAQ.shtml Accessed: February 2006; Updated: July 2004.
CHAPTER 5. EXPERIMENTAL EVIDENCE 82
6.9. Where no candidate list is produced, yet MMTx identifies a probable concept, the
associated text is tagged as “unknown”. The candidates themselves are in the form of
Concept Unique Identifiers (CUIs), a UMLS-specific unique identifier that exists for each
concept within the ontology.
Once a candidate list for all concepts in the source report have been identified, the CUIs
for each are then extracted. The context of each target concept is determined based on
a context-window size, and a list of a CUI pairs is produced based on the target concept
and the concepts with which it co-occurs. For each target concept an additional CUI pair
is added to represent the relation of the target concept to the overall topic concept of the
report: In all cases, the test reports contain a title sentence that identifies the anatomical
region of focus. This title sentence is used to manually create a topic identifier in the form
of a CUI from the UMLS, which is then paired with the target concept to create a final CUI
pair.
For each CUI pair a semantic distance value is calculated using a distance calculator I
have designed in Perl, called sem dist. Using a reverse breadth-first search, sem dist will
search the UMLS MySQL database for a common parent of the two CUIs, CUI1 and CUI2.
Starting at CUI1, the algorithm searches up one level in the tree for all parent concepts of
CUI1, while doing the same for CUI2 in parallel. This generates two parent lists, P1 and
P2, which are compared. If no common parent is found, then starting with P1, the parent of
each CUI ∈ P1 is systematically calculated and compared to a list of the nodes traversed so
far. If a match is encountered, then the steps from CUI1 and CUI2 to the common parent
are counted and totaled.
The above calculation results in a distance score for each CUI pair that was created for
a target concept. Since the goal is a confidence ranking of the target concept, these scores
must be combined into a single ranking. As an initial aggregation result, the semantic-
distance results for each concept in a report were averaged to produce the final confidence
score. The topic concept can be used as an independent measure of error – the semantic
distance scores between the concepts in the body of the text and the topic marker can
indicate errors. Alternatively, the distance from the topic concept may be averaged with
the other semantic-distance results for a particular target concept in the body of the text
to create one measure of confidence. The topic distance may also be weighted to reflect
the theoretical difference between its effect on the final confidence score versus the score
produced via the semantic distance of the neighbouring words.
CHAPTER 5. EXPERIMENTAL EVIDENCE 83
One of the benefits of this approach is that it avoids the problems associated with loops
in the ontology4. In this implementation, the similarity of two unique terms, or CUIs, is
determined by working up through the database, following any relevant parent links until
an intersection is found. Since this process is always going up, it is not possible to get
trapped in a loop, as all terms must have a parent. If the search space ends (i.e. a root, or
even a sub-root concept is detected), the search terminates and the linking CUI is labeled
as “NULL”.
Results and Discussion For each concept detected in a report, the semantic distancer
returned a distance measure from the surrounding words defined by the window size, and a
separate distance measure from the topic concept marker. These results were then manually
analysed to determine the errors indicated.
After running the experiment on the test cases, it was found that as many as 10% of the
concepts within some reports were tagged as “unknown”. Compounding this, when applied
to these reports the results of the semantic distancer were inaccurate due to unrecognized
concepts, which were scored as zero. As a first step to recover from this problem, such results
were excluded from the calculation of the final average. Unfortunately, further investigation
revealed a discrepancy between the MMTx and the current UMLS: As new versions of the
UMLS are released, concepts that are considered obsolete are removed, or replaced with
more accurate ones. It is often the case that the MMTx maps report concepts to these
“retired” CUIs. Thus, when such CUIs are referenced in the MySQL database there are no
longer entries for those concepts and therefore they cannot be used for the semantic distance
calculation. Due to time constraints, this remains an open problem for future work and is
discussed in Chapter 6.
Lastly, as an unfortunate consequence of the UMLS design (a compilation of source
vocabularies) a small number of concepts could not be linked via the parent link provided in
the UMLS relationship database. Such concepts were reserved by their source vocabularies
and thus could only be linked to other concepts within that source vocabulary. Occasionally,
the concept of interest lay in a different source vocabulary and could not be linked to across
vocabularies via existing relationships. While common links do exist for all concepts at
the Semantic Network level, the sensitivity to differences between concepts at that level
was insufficient. The broad granularity of Semantic Network meant that all concepts were
4Caviedes and Cimino [23], among others, have observed such loops within the UMLS.
CHAPTER 5. EXPERIMENTAL EVIDENCE 84
generally within a short distance of one another or had identical distances, and thus the
calculation was not useful. Also, the nature of the UMLS is such that the source vocabularies
are still governed by their own access rights. This project was limited to those source
vocabularies for which access was free, consequently this has resulted in a fractured ontology
to some extent.
In an effort to minimize the impact of unusable concepts within a target concept’s win-
dow, the window size is limited to collocations (defined here as a concept and its immediate
neighbour). The concepts in the immediate vicinity of one another (such as two consecutive
concepts), however, are often locally different with respect to meaning resulting in unusable
semantic distances. That is, two words side by side may be conceptually distant from one
another, despite being related to the surrounding sentence when considered in its entirety.
In contrast, the inclusion of more concepts in a large context window can smooth out the
normal degree of variation among local concepts so that only those that are exceptionally
distant are actually able to trigger an error tag.
The combination of these factors has resulted in an incomplete implementation of the
semantic-distance heuristic. The sample set of those results untouched by any of the above
issues compiled with a reasonable context window size (a window size of at least three to
minimize the local variations mentioned above) was too small to be of any value. Nonethe-
less, of that small set there were examples of out-of-place words that were conceptually
unrelated to the report that showed very low confidence scores. This, combined with the
results in Caviedes and Cimino [23], provides support for the underlying semantic-distance
concept and indicates further work in this area will prove fruitful.
In conclusion, this remains an open problem due to implementation details, and not
issues related to the underlying concept.
5.2.7 Semantic Grammar
Since the needed analysis of the semantic roles of the concepts within the radiology domain
is an extensive project, an implementation of the semantic-relationship analysis falls out of
the scope of this thesis (and is not necessary to establish our proof of concept). Despite this,
the parser discussed in the following section, Section 5.2.8, has been designed to support
semantic constraints, such as thematic roles, and roles specific to radiology. Thus, once
the above analysis is complete, it will be a straightforward task to augment the existing
syntactic parser. This is discussed in more detail in Chapter 6.
CHAPTER 5. EXPERIMENTAL EVIDENCE 85
5.2.8 Syntactic Analysis
Overview As discussed in Section 4.4.2, the use of stop lists and surface-level anal-
ysis prevents statistical-based methods from achieving 100% efficacy in error detection. A
syntactic parser, however, can be used to identify syntactic errors, including those which
involve stop words and deletions.
With this in mind, a parser was developed to analyse radiology reports. In the interest
of rapid prototyping sufficient for proof of concept, the parser was built upon a constraint-
handling-rules grammar, or CHRG [29] and inspired by Property Grammars [10]. Dahl and
Blache [35] demonstrate this combination of grammar formalities to be a robust option, with
the ability to handle various levels of granularity, as well as incomplete and incorrect input.
As discussed in Section 4.4.2, such flexibility is necessary to handle incomplete sentences
and note-form such as often found in a radiology report. Furthermore, by characterizing the
grammar as a series of properties, the properties constraining the language within radiology
reports are easily captured.
The parser’s design has left open the possibility of extending the constraint base to
include semantic constraints. This involves interfacing with an ontological knowledge source,
such as the Unified Medical Language System (UMLS) [15], to obtain the semantic properties
of phrases which can be used to test semantic-based constraints as mentioned in 5.2.7. For
example, a verb may be restricted to apply to only anatomical concepts.
Materials This experiment uses MMTx as an initial partial parse of the text (see
5.2.6). The main parser was developed in SICStus Prolog, v3.12.3 under a temporary stu-
dent license, using SICStus’s built-in constraint handling rules (CHR) implementation and
Henning Christiansen’s CHR grammar (CHRG) system, v0.1 [29]. For those unfamiliar,
a brief introduction to CHRs is provided in Appendix A, while a more in-depth introduc-
tion to CHRs and CHRG is provided in Fruhwirth 1994 [50] and Christiansen 2005 [29],
respectively.
All 30 test reports are used in this experiment.
Method As a preprocessing step, each test report was run through MMTx, a program
that maps biomedical text into UMLS concepts5. MMTx provides semantic information
for each report in the form of UMLS Concept Unique Identifiers (CUIs), part-of-speech
5Available at http://mmtx.nlm.nih.gov/ Accessed: February 2006; Updated: February 2006.
CHAPTER 5. EXPERIMENTAL EVIDENCE 86
tagging, as well as basic phrasal information. The tagging was particularly important as
MMTx includes a tagger trained on medical texts. Since a tagged, training corpus was not
available to train a tagger, this was an invaluable resource. As an example, the phrase “of
the thoracic spine”, once passed through MMTx and a pre-processor (which modfies MMTx
output for input to the error-detection parser) is returned as the following:
rep_phrase(1,’of the thoracic spine’,[prep([tag(prep),tokens([of])]),
det([tag(det),tokens([the])]),mod([tag(adj),tokens([thoracic])]),
head([tag(noun),tokens([spine])])],[’C0581269’,...,’C0024659’],3,4).
As a second level of analysis, the parser was created in SICStus Prolog using CHRG
[29] and Property Grammars [10], a means for representing the structure of language as
properties constraining the allowable constructions within that language. The modified
reports are input to the parser and analysed according to a grammar created atop the
CHRG formalism and inspired by property grammars6.
Based on each phrase identified via rep phrase/6, the parser first performs a series of
property checks to determine the appropriate phrase type. Each phrase type has its own rule
set defining its specific properties. Unique to property grammars, the properties defining the
allowable constructs within the grammar can be tagged as “relaxable” [36]. While needing
to relax a property is likely to indicate an error (i.e. an incorrect term or an incomplete
phrase), the parse is able to continue and information regarding the nature of the error is
collected (i.e. those properties that were not satisfied). The result is a robust parser that
does not fail in the face of errors. This is an ideal solution for error detection making it
possible to detect and locate errors within the text.
When parsing, each “rep phrase” is compared to the properties within the grammar
to identify a phrase-type candidate. When identified, the phrase is added as a phrase
constituent to the constraint store. In some cases, the property check is pre-empted by
a keyword that triggers the automatic assignment of a phrase constraint. For example,
auxiliary verbs such as “is” are immediately tagged as phrases of type “is”. If no keyword
6The grammar developed was not intended as a linguistically robust representation of English, but ratheras a functional implementation of the characteristics of radiology reports. Thus, there are some deviationsfrom typical parse-tree constructions attributed to English sentences in the interest of computational feasi-bility, and speed of development. Future iterations of this parser will see a more in-depth analysis of theunderlying linguistic properties, and a more careful eye to the elegance of the resulting formalism.
CHAPTER 5. EXPERIMENTAL EVIDENCE 87
is detected, then the phrase is passed on to the property check. There are three possible
cases that result from the property checks.
In the first case, all of the requisite properties are observed and the phrase is successfully
created with the matching phrase type. Since Prolog works from top to bottom when
analysing rules, the properties are tried in the order they are presented in the grammar
formalism. Thus, it is important to differentiate the rules representing various phrase types
by a unique list of properties; where phrase rules are ambiguous, careful consideration must
be given to the order presented since the first phrase to match will be the one added to the
constraint store, even if a phrase later in the Prolog listing is also possible. This latter phrase
will only be tried should the parse fail and have to backtrack to the phrase assignment rules.
In the second case, none of the required properties for any of the phrase types are met
and the attempt fails. The phrase is then tagged and added to the constraint store as
“unknown”.
In the third and final case, one or more properties labeled as “relaxable” may not be
met. Being relaxed, these properties are added to a list of unsatisfied properties but do
not halt the parse. As a result the parse will continue until all properties for the current
phrase-type are met or are tagged as “relaxable”, or until a non-relaxable property is not
satisified. In the latter case, the phrase-type rule will fail and the next phrase type will be
tried. If a property check succeeds, then, as mentioned above, a constituent is added to the
constraint store that represents the phrase, phrase type and a list of the relaxed properties
that were unsatisfied.
Beyond the phrase-type identification, the rules of the grammar are defined via con-
straint handling rules (CHRs). After each change in the constraint store, the CHRs are
consulted and, wherever applicable, constraints are modified according to these rules and
the constraint store is updated. In this way the parse is completed, conjoining sub-phrases
as permitted by the CHRs. When no further changes are possible the system has “settled”
and the current contents of the constraint store are output. During the parse, the system
maintains a list of all “unknown” constituents. These are also output at the end.
The interpretation of the results for error detection is currently performed manually
for the purposes of this experiment. Errors can take three forms given the parser output:
phrases tagged as “unknown”, unsatisfied property lists, and incomplete parse segments.
“Unknown” tags represent words or phrases that went unrecognized by MMTx, or subse-
quently could not be assigned a phrase type by the parser.
CHAPTER 5. EXPERIMENTAL EVIDENCE 88
The following is an example property check for a verb phrase:
vp_properties(CUI,L,L2,L3,UnsatX,S,F):- Unsat=[],
( has_x(verb,L), append([],Unsat,Unsat2) ; relax(has_verb),
append([has_verb,S,F],Unsat,Unsat2) ),
UnsatX=Unsat2.
This rule enforces the property of having a verb in order for a phrase to be considered a
verb phrase. If a verb phrase is expected but no verb is present, however, by relaxing this
property and adding it to the unsatisfied property list (represented by Unsat) the parse can
proceed. The information on the unstatisfied properties is then available at the end of the
parse. For the purposes of error detection, all properties were marked as “relaxed”.
Next is an example constraint handling rule:
constit(np,X,Y), constit(vp,Y,Z) ==> constit(s,X,Z).
The preceding rule activates when a np and vp (noun phrase and verb phrase, respectively)
are present consecutively in the constraint store (that is, from X to Y , and Y to Z), and
adds a further constraint to the store that represents a sentence, or s, across those words
(that is, from X to Z). This is a simplified version of the actual rule for the purposes of
readability here.
Table 5.2: CHR parser results on all error types.
Accuracy Corpus SizeError Subset Recall Precision f-measure TestAll Errors 29% 34% 32% 30
Syntactic Errors 71% 17% 27% 30
Results Table 5.2 shows the result of aplying the CHR parser to the 30 test reports
When restricted to syntactic errors, the recall improves considerably. While this corresponds
to a large drop in the precision, this is attributable in part to the measurement of the
precision over all possible errors. Essentially, the set of correctly-detected errors is reduced to
include only those that are syntactic, while maintaining the total number of errors detected
(which may include correct detections of non-syntactic errors).
CHAPTER 5. EXPERIMENTAL EVIDENCE 89
Some of the undetected errors were attributable to errors introduced at the MMTx
level. In some instances, some concepts were recognized and assigned a CUI, yet tagged as
“unknown”. From the MMTx perspective, this differentiates between terms that were found
in the Metathesaurus, but not in the SPECIALIST lexicon (and thus “unknown”). As a
result, given the sentence, “This examination extends from the T9 and T10 disc space to the
S2 and S3 level.” the terms “T9” and “T10” are assigned the correct CUI values, indicating
that they were correctly identified in the UMLS Metathesaurus, yet they are tagged with
“unknown”. Since the parser relies on “unknown” tags as an indication of an error, this
falsely indicates “T9” and “T10” both as errors.
Discussion While the parser performs poorly on the entire error set, the recall for
syntactic errors only is noteworthy. Developing the parser further will improve this result,
and refine the precision score. However, these results are of particular interest as they
show a strong affinity for syntactic errors, which will be useful in the hybrid approach.
Furthermore, by analyzing on the basis of syntax it is possible to identify stop word errors,
which are typically ignored by other methods (i.e. statistical-based methods, and semantic
analysis).
Though the parser takes longer than the following statistical techniques (up to three
minutes in the worst case on exceptionally long sentences), there is no overhead cost associ-
ated as with generating the co-occurrence statistics. Also, in all cases the slow run time was
attributable to the preliminary nature of the parser and will improve with future iterations.
5.2.9 Word Occurrence Probabilities
Two probabilistic techniques for the proposed hybrid method have been developed, namely
co-occurrence relations and Pointwise Mutual Information (PMI). Underlying both is the
key notion that through identifying patterns common to error-free reports, inaccuracies in
novel reports can be automatically detected. The theory underlying both of these techniques
is discussed in Chapter 4.
As part of the setup for both co-occurrence analysis and PMI, the co-occurrence statistics
of varying window sizes have been compiled for the 2751, anonymised MRI reports. Recall
that in co-occurrence analysis, stop words are usually omitted since their overabundance in
a text can negatively affect the resulting probabilities and limits overall error detection.
CHAPTER 5. EXPERIMENTAL EVIDENCE 90
Co-Occurrence Analysis
Overview As mentioned above, patterns within error-free reports can be used to
detect errors within novel reports. One means for identifying these patterns is via co-
occurrence relations [81, 96, 131], a statistical method for determining the number of times
a word occurs in a specific context window. Given a sufficiently representative training
corpus, we can associate words with particular contexts based on that corpus. We can then
apply these word-context statistics to determine the probability of a word occurring in a
given context in a report. If that probability falls below a certain threshold the word will
be flagged as a possible error.
Materials As an experiment on the effect of training data on statistical-based error
detection, a further test was run on the basis of splitting the training corpus into several
training sets: the full 2751 reports, as well as those obtained from dividing by section and
dividing by report type (i.e. anatomic region being studied). These divisions reflect the
observations that the type of words found in the “Findings” and “Impressions” sections
may differ from the “History” section, while the type of words found within a knee report,
for instance, are not as likely to occur in a report of the shoulder. Thus, by training and
testing these separately, there is no risk of dilution from other report types, increasing the
accuracy.
The final training sets include: all reports; reports separated into the “Findings” and
“Impressions” sections; and reports of the spine. To ensure adequate statistical represen-
tation, the training sets are restricted to those containing 800 or more reports. Of the
2751 reports divided by anatomic region, only “spine” had enough cases to meet the 800-
minimum requirement. Separate co-occurrence statistics are generated for each training set
based on the current context window size.
Method In the testing phase, a corpus of 30 uncorrected/corrected, anonymised report
pairs was obtained from the CDC using Dragon NaturallySpeaking7. For each uncorrected
report the context of each word and the relevant co-occurrences are determined. The ap-
propriate collection of co-occurrence statistics from the training data is then applied to
7The experiment on the effect of the training data on system performance was done prior to obtaining thefull test corpus, and was instead based on a 20-report corpus subset of the full test corpus. See the resultssection for more information.
CHAPTER 5. EXPERIMENTAL EVIDENCE 91
determine the relevant probabilities of the co-occurrences in the test report8.
Using Bayes’ Theorem (Equation 4.4, repeated in Equation 5.1 for convenience), it is
possible to combine the probability of each word that occurs within the context window of
the target word, and the probability of the target word itself, where wt = target word, and
C = context words. Bayes’ Theorem is a formula that allows us to calculate conditional
probabilities: the probability of an event, A, given the knowledge that another event, B,
has already taken place. In simpler terms, this means that the probability of our “event”,
the target word wt, can be calculated in terms of the probability of another “event”, the
context C. Since the target word and the context are closely related, this is an informative
calculation.
P (wt|C) =P (wt) ∗ P (C|wt)
P (C)(5.1)
The expression P (wt|C) is read “the probability of wt given C”. The probability of the
target word, P (wt), is equal to the probability of occurrence in the training corpus. Since
we have already observed the context of the target word, we know that its probability of
occurring is 100%, thus P (C) = 1. Finally, we can calculate P (C|wt), the probability of the
context C occurring given the target word wt, using the Principle of Joint Probability, as
discussed in 4.4.3:
P (C|wt) = P (wt)P (C1, ..., Cn|wt) By Joint Probability
= P (wt)P (C1|wt) ∗ ... ∗ P (Cn|wt) (5.2)
= p(wt)n∏
i=1
p(Ci|wt)
With this information we can now calculate our desired probability, P (wt|C).
For example, applying Bayes’ theorem to the sentence fragment in Section 5.2.4, Sentence
8Note that the training corpus statistics must be calculated on the same context-window size. Thus, whileit is possible to change window sizes, doing so requires a recalculation of the training corpus statistics.
CHAPTER 5. EXPERIMENTAL EVIDENCE 92
0
20
40
60
80
100
120
Recall Precision f-Measure
Accuracy
Per
centa
ge
All Reports
Findings
Impressions
"Spine"
Figure 5.1: CA results based upon report type.
1, yields the following:
P (eye|possible, spondylolysis, laterally, L5) =
P (eye)P (possible, spondylolysis, laterally, L5|eye)(5.3)
Once we have obtained the value of P (wt|C) via Bayes’ Theorem, it can be compared to
a threshold value, k, flagging those target words, wt, where P (wt|C) < k. Thus, those words
in a report are captured whose occurrence in their context window is highly improbable.
This improbability reflects the likelihood of a recognition error.
For example, after processing Equation 5.3 we have P (eye|possible, . . . , L5) = 4.37067E−07, a correspondingly low value that reflects the unlikelihood of “eye” occurring in that con-
text. Assuming an appropriate threshold k, this is flagged as an error.
Results All results were collected by a manual analysis of the co-occurrence analyser’s
output. The graphs presented in Figures 5.1, 5.2, 5.3 and 5.4 are based on the data tables
in Appendix C.
CHAPTER 5. EXPERIMENTAL EVIDENCE 93
0
10
20
30
40
50
60
70
80
90
100
0 5.00E-06 5.00E-04
Threshold
Recall
Collocation
Windowsize 1
Windowsize 10
Figure 5.2: CA recall results for 3 window sizes.
Since splitting by report type seems to indicate a generally positive impact, following
the experimental results obtained in Figure 5.1 all subsequent experiments were run on
only the “Findings” and “Impressions” sections simultaneously. Since these sections have
share similar language usage, combining them compensates for the lack of text in using the
“Impressions” section alone. Without obtaining more training data, the “spine” category
was deemed too small at this stage for accurate analysis.
The system is able to identify error candidates in under a minute in all cases, under-
scoring its viability for real-time use. There is a one-time overhead cost associated with
generating the co-occurrence statistics for the training sets. Once generated, however, the
database is simply stored and referenced. Re-generation would only occur if new training
data were added.
Figure 5.1 demonstrates the effect of splitting the training and test corpus by report type
and section. Note that this experiment was performed prior to obtaining the full 30 reports
in the test corpus. Therefore, the results in Figure 5.1 (and in Table C.1 in Appendix C)
were run based on a 20-report corpus subset of the full test corpus.
CHAPTER 5. EXPERIMENTAL EVIDENCE 94
0
5
10
15
20
25
30
35
40
45
50
0 5.00E-06 5.00E-04
Threshold
Pre
cis
ion Collocation
Windowsize 1
Windowsize 10
Figure 5.3: CA precision results for 3 window sizes.
0
5
10
15
20
25
30
35
40
45
50
0 5.00E-06 5.00E-04
Threshold
f-M
easure Collocation
Windowsize 1
Windowsize 10
Figure 5.4: CA f-measure results for 3 window sizes.
CHAPTER 5. EXPERIMENTAL EVIDENCE 95
Discussion The high recall in Figure 5.2 reflects a high sensitivity to errors and a low
rate of false negatives. This is especially important as errors missed could have serious ram-
ifications. In contrast, the precision is low, indicating a high rate of false positives, as seen
in Figure 5.3. Although still important overall, false positives are nonetheless identifiable
by the radiologist and do not affect report quality. In most cases these false positives are
generated by data sparseness, that is word-context pairs that were not previously encoun-
tered in the training data (c.f. Section 6.8). Thus we have P (C|wt) = 0, which results in
P (wt|C) = 0 by Equation 5.1. Evidence for this is seen in the “Impressions” data set, which
typically held the smallest amount of text, and the smallest training set. Correspondingly,
it has the lowest precision rate shown in Table 5.1. By increasing the number of reports
in the training corpus, however, it can ensure greater coverage of the terms that typically
occur in a radiology report. This will cause the rate of false positives to drop and improve
the precision. While the ideal training corpus would contain every possible context of every
possible word in a radiology report, radiology nonetheless does not exhibit a wide variation
within reports. A fairly accurate depiction of the possible patterns within a report is fea-
sible with a large enough training set. Interestingly, though, some false positives may be
advantageous, indicating rare occurrences that merit closer inspection by the radiologist to
ensure there are no mistakes.
Separating the training and testing data by section has a positive impact, shown in
Table 5.1, though further testing is needed. This result is encouraging as “Impressions”
is the section most likely to be read by the referring physician. As mentioned above, the
lower precision for “Impressions” is explained by the typically small amount of text in this
section. Thus, while separating by type improved recall, overall the training set was still
too small for as effective an analysis and must be followed up with more data.
The rate of error detection, or filtering, is affected by the threshold value, k. Higher
values of k, mean less filtering and a higher WER, while lower values of k, mean greater
filtering and a lower WER. In this way it is possible to increase the recall level to near 100%,
however, there is a corresponding loss of precision. Nonetheless, this does allow for some
flexibility in balancing between the recall and precision measurements.
Unlike the syntactic grammar discussed in Section 5.2.8, this analysis, as with other
statistical methods, omits stop words, or low-information-bearing words. These words are
ignored because it is often observed that a mis-recognized stop word rarely entails a shift
in the intended semantics. Exceptions exist, however, such as a substitution of “and” for
CHAPTER 5. EXPERIMENTAL EVIDENCE 96
“at the”, that may have more serious consequences in medicine, and may prove difficult for
human editors to detect.
The decision of the threshold value was one of trial and error. In the end, a minimal or
zero threshold gave the best results in light of the already low precision scores. If a larger
training corpus improves the precision score, a potentially more appropriate threshold could
be chosen. Similarly, the size of the context window was also one of trial and error. The
best output was obtained with a window size of one, reflecting the highest recall balanced
with the highest precision. Future experimentation with a larger variety of window sizes
will determine if a better value is possible.
Still, these results are encouraging, and demonstrate the feasibility of post-processing
error detection as a means to recover from the low accuracy of ASR in radiology.
Pointwise Mutual Information
Overview As a comparative measure, the PMI heuristic was developed according
to the work in Inkpen and Desilets [75]. Like the co-occurrence method above, given a
sufficiently representative training corpus, it is possible to derive word probabilities based
on the probability of occurrence within that corpus. Similarly, the probability of a word
co-occurring with another word within a particular context window can be determined by
the frequency of such a co-occurrence within the training corpus. The probability of two
words occurring independently versus the rate at which they occur together, provides a
measure of independence that can be used to determine the likelihood of a word occurring
in a given context in a report. If that probability falls below a certain threshold the word
will be flagged as a possible error.
Materials The training corpus was based on the full corpus of 2751 reports. As
needed, separate co-occurrence statistics for varying context-window sizes were generated.
Method As described in Inkpen and Desilets [75], a semantic similarity score between
two words, w1 and w2 is based on the shared information load of both words. Here “in-
formation load” refers to contextual predictivity, that is the notion that a word can be
predicted by its preceding word. Equation 3.7 shows the calculation of PMI for two words
(and is repeated in Equation 5.4 for convenience): C(w1, w2), C(w1) and C(w2) represent
the frequency of occurrence (in the training corpus) while n is the total number of words
in the corpus [75]. Therefore, the PMI semantic similarity measure is a reflection of the
CHAPTER 5. EXPERIMENTAL EVIDENCE 97
probability of two words occurring together and the individual probability of each word
occurring in the training corpus, where “together” is limited by the defined context-window
size [75].
PMI(w1, w2) = logP (w1, w2)
P (w1) · P (w2)
= logC(w1, w2) · nC(w1) · C(w2)
(5.4)
The basic PMI calculation in Equation 5.4 is applied to two individual words. In the case
of a document, however, the desired outcome is the semantic similarity of an individual word
with respect to the context in which it occurs. Thus, the calculation of PMI(w1, wordlist)
is as follows: For each word, w, in an uncorrected report, d, the probability of that word
(occurring in the training corpus), P (w), is determined. In addition, the co-occurrences oc-
curring within C(w, d, n) are calculated, given a window size, n, that is, all tuples comprised
of w paired with all members within the context window of w. For each co-occurrrence,
the probability of that pair occurring in the training corpus is calculated9. This results in a
calculation of the individual probabilities for each word with respect to its context, in other
words P (w1, w2). Given this value and the individual probabilities, P (w1) and P (w2), the
PMI calculation in Equation 5.4 is applied to determine the semantic similarity between w1
and w2. In order to arrive at a single measure of PMI for a word, w, within C(w, d, n),
the results are subsequently aggregated by averaging their probabilities over the size of the
context window [75] (as was done in 5.2.9).
Once the cumulative PMI value is obtained for each word, these results are normalized
by adding 100 to each value (this removed any negative numbers in the dataset). The
final, normalized results, are compared to a threshold value, k, flagging those target words,
wt, where P (wt|C) < k. Thus, we capture those words in a report whose occurrence in
their context window is highly improbable. This improbability reflects the likelihood of a
recognition error.
As with the co-occurrence analysis, to find the actual errors in our test reports, the cor-
rected and uncorrected reports are aligned to identify any errors. These errors are compared
9As in the co-occurrence analysis, the training corpus statistics must be calculated on the same context-window size.
CHAPTER 5. EXPERIMENTAL EVIDENCE 98
0
10
20
30
40
50
60
70
80
100 101 102 103
Threshold
Recall
Collocation
Windowsize 1
Windowsize 10
Figure 5.5: PMI recall results for 3 window sizes.
to the flagged errors from the program output to obtain the results.
Results Like the co-occurrence analysis, all results were collected by a manual analysis
of the PMI analyser’s output. The graphs presented in Figures 5.5, 5.6 and 5.7 are based
on the data tables in Appendix C, and show the recall, precision and f-measure as related
to the chosen threshold value for three separate window sizes, namely collocation (word
pairs/bigrams), and 1 and 10 words to either side of the target word, respectively.
Again, like the co-occurrence analysis, the system is able to identify error candidates in
under a minute in all cases. The same one-time overhead cost associated with generating
the co-occurrence statistics for the training sets exists.
Discussion The results shown here do not reflect the same degree of success that
was seen in Inkpen and Desilets. This is a reflection of the difference in domains (meeting
transcriptions versus radiology reports) and the significantly smaller training set used. If a
word is not found in the training data, then its probability and the probability of it occurring
in any co-occurrence tuples will be zero, resulting in an incalculable PMI value. By default,
the system sets these values to zero, indicating no similarity.
Like in Section 5.2.9, the rate of error detection, or filtering, is affected by the threshold
CHAPTER 5. EXPERIMENTAL EVIDENCE 99
0
5
10
15
20
25
30
35
40
100 101 102 103
Threshold
Pre
cis
ion Collocation
Windowsize 1
Windowsize 10
Figure 5.6: PMI precision results for 3 window sizes.
0
5
10
15
20
25
30
35
40
100 101 102 103
Threshold
f-M
easure Collocation
Windowsize 1
Windowsize 10
Figure 5.7: PMI f-measure results for 3 window sizes.
CHAPTER 5. EXPERIMENTAL EVIDENCE 100
0
10
20
30
40
50
60
Recall Precision f-Measure
Accuracy
Percentage
PMI
COA
Figure 5.8: PMI versus Co-occurrence Analysis (COA).
value, k, and was established via trial and error. Here the results were normalized (to avoid
negative values) and the best overall results were obtained with a threshold of k = 100.
Since this is a corpus-based technique, as described in 5.2.9 it could easily be extended
to other areas of medicine that share the same properties of restricted vocabulary seen in
radiology, provided an adequate training corpus is available.
5.2.10 Comparing Co-occurrence Analysis and PMI
Table 5.8 shows the comparison between the performance of the co-occurrence analysis and
the PMI analysis, based upon the best results obtained within each (that is, the window
size and threshold that results in the highest f-measure). As mentioned previously, by
incorporating multiple techniques with the same error-type coverage, the result is more
reliable results and consequently a more robust system.
Both the co-occurrence and the PMI analysis techniqes could easily be extended to other
areas of medicine that share the same properties of restricted vocabulary seen in radiology,
provided an adequate training corpus is available.
CHAPTER 5. EXPERIMENTAL EVIDENCE 101
0
10
20
30
40
50
60
70
80
Recall Precision f-Measure
Accuracy
Per
centa
ge
Best Co-Occur
Best PMI
Parser
Hybrid
Figure 5.9: Combined heuristics on all errors based upon top f-measure (overall perfor-mance).
5.3 A Hybrid Approach
As a proof of concept of the proposed hybrid error-detection method, the above heuristics
have been applied in combination to the test corpus. The results in Figure 5.9 show each
(completed) heuristic applied to the entire error set (regardless of the error subset for which
the heuristic is capable of performing on), to reflect its actual performance in the radiology
report setting. “Combined” refers to the application of all heuristics together on a report
analysis via the direct method as described in Section 4.2. The combined result shows a 24%,
8%, and 14% increase in recall, precision and the f-measure, respectively, over the best single
heuristic technique, co-occurrence analysis, when compared according to highest f-measure
performance. The high increase in recall is perhaps the most promising as it demonstrates
an increased sensitivity to actual errors, and consequently a lower rate of false negatives.
Clearly, these results favour the hybrid method over previous, independent applications of
error-detection methods in ASR when applied to radiology reports.
CHAPTER 5. EXPERIMENTAL EVIDENCE 102
5.4 Summary
This chapter has successfully demonstrated that the conceptual model presented in Chapter
4 is viable and offers the final, concluding evidence that post-recognition error detection can
improve the quality of speech recognition output in radiology dictation. In addition, the
hybrid approach to error detection was shown to be an improvement over any single error-
detection heurstic. In light of these conclusions, the next chapter examines the consequences
and corollaries of the research presented so far.
Chapter 6
Observations and Corollaries
6.1 Introduction
Given the findings in the preceding chapters, it should now be clear that post-ASR, hybrid
error detection is an effective means to recover from low recognition rates in radiology report
dictation. In this chapter, these findings are summarized, and the research questions posed
in Chapter 1 are re-examined. Finally, a critique of the hybrid methodology is provided,
including a list of challenges currently being faced, as well as a look at the implications of
this study and its impact on future studies.
6.2 The Findings
6.2.1 The Hybrid Error-Detection Methodology
The preceding chapters have demonstrated a successful application of the hybrid, multi-
heuristic algorithm, achieving a performance increase by as much as 24% (recall score)
over any single heuristic technique tested. This solidly shows that post-ASR, hybrid error
detection is an effective means to recover from low recognition rates in radiology report
dictation. In addition, the results from a series of error-detection heuristics were evaluated
and applied to the problem of error detection in speech-recognized radiology reports. Each
heuristic was evaluated as applied to the entire set of possible errors within a report, as well
as to a subset of errors for which the technique was determined to be the most suitable.
For instance, since the probabilisitic methods all employ a stop list, any errors involving
103
CHAPTER 6. OBSERVATIONS AND COROLLARIES 104
such words cannot be detected (unless they cause an additional error of a type detectable
by that algorithm). Thus, while the system may perform reasonably well when restricted
to its detectable error set, since the goal is a system capable of detecting any errors in a
report, the system’s performance on the entire error set is the primary concern.
The individual results of the probabilistic heuristics are examined using a variety of
context window sizes as well as varying threshold factors controlling the degree of filtering
(i.e. the percentage of words actually tagged as errors), including a study on the effect
of report type and section on the N-gram model. In general, the smaller the windowsize
used in the N-gram model (and in the subsequent test cases) the poorer the precision rate.
This reflects the inability of the model to sufficiently generalize about the characteristics
of errors, resulting in an oversensitivity and a tendency to overtag. The high recall further
reflects this, as an exceptionally low precision is no different than tagging all words within
a report as errors – in such a case while the recall is 100%, the precision is 0%.
Adjusting the threshold value seems to reflect a tradeoff between recall and precision.
With a low threshold, the recall is high while the precision is low. As the threshold increases
and more errors are identified, the recall increases, while the precision drops. Nonetheless,
the best ratio, calculated via the f-measure (a combined measure of precision and recall),
was found when the threshold was set to zero (or 100 in the case of the normalized PMI
data).
With respect to the co-occurrence analysis, a further step tests the effect on the N-gram
model by splitting the corpora by report type (i.e. anatomical region) and by report section
(limited to “Impressions” or “Findings”, the two largest sections of free text). This test
was performed in the early stages of this research, and therefore on a smaller test corpus
than the eventual 30 reports. Nonetheless, while the “Impressions” dataset proved to have
too little training data (due to the typically small amount of summative text found in the
“Impressions” section), dividing by the “Findings” section and by anatomical region (recall
that “spine” is the only corpus in this study with sufficient reports to support this division)
showed an overall increase in f-measure of at least 6% (see Table C.1), suggesting that
restriction by type or section does have a positive impact and is worth further investigation.
Although such divisions do require multiple training corpora, again this is a one-time, up-
front cost.
While the results of the PMI heuristic are lower than the those obtained by Inkpen
and Desilets, this is not necessarily indicative of performance failure but rather reflects
CHAPTER 6. OBSERVATIONS AND COROLLARIES 105
the differing domains to which the technique was applied. Further study is needed with a
comparable training data set. From the perspective of the hybrid technique, however, the
performance of the PMI heuristic is sufficient for proof of concept.
Not surprisingly, the parser was found to perform reasonably well on syntactic errors
alone, and more poorly on the entire error set. Nonetheless, the design is such that the
rule set can be readily expanded to account for a wider variety of errors, as well as to
incorporate greater sensitivity to syntactic errors, which will in turn improve the parser’s
individual performance.
6.2.2 On the Nature of Report Errors
After extensive analysis of the test corpus, coupled with further discussions with radiologists,
the following observations on the nature of the errors as found in radiology reports have
been compiled.
Recognition Errors
Error Bias In many reports from the CDC-compiled test corpus, the repetition of
recognition errors within an individual report was frequently noted. That is, once a recog-
nition error of a particular kind was made, the recognizer seemed to show a bias towards
that same error wherever the corresponding sequence of words was repeated. For example,
in one test report the substitution “cassettes” for “facets” was made three times. While on
the surface this may seem reflective of vocal variations among radiologists, in several cases
such error repetition was found to occur only in some, but not all, reports dictated by the
same person. This may suggest transient vocal or ambient influences on the radiologist oc-
curring between reports, such as having a cold or a temporary change in background noise;
to eliminate this possibility a larger sampling of erroneous reports is needed, as well as a
record of the conditions under which the person is dictating. If speaker variation can be
eliminated then the root cause of the repetition may be linked to the recognizer itself.
Insidious Errors Many of the recognition errors within the test corpus were partic-
ularly inconspicuous, such as the substitution of “is” for “as”. When skimming a report
for errors these mistakes can easily be overlooked due to their similarity. Furthermore, it
is often the case that the proofreader may subconsciously correct for the error, especially if
he has dictated the report, as his own expectations can introduce bias. Although in some
CHAPTER 6. OBSERVATIONS AND COROLLARIES 106
cases the intended sentence or phrase may seem clear, when relying on computer-generated
summaries these errors will nonetheless affect the final summarization and any subsequent
reasoning based upon this summarization.
Particularly insidious errors for both humans and computers include deletion errors;
while many deletion errors are detectable by the syntactic parser when a word’s omission
results in an ungrammatical sentence, when the deletion results in an acceptable sentence
such errors are virtually undetectable. Examples include the omission of “A-P” in the
fragment “normal A-P alignment”, the omission of “and” in “central and canal”, or the
omission of “no” in “no evidence of”, where the resulting sentence is still parseable.
Such errors are challenging as not only is such a deletion often a serious one, as in the
case of a missing “no”, but current NLP technology is virtually unable to detect such errors.
Required is a deep understanding of the text from a semantic, discourse, and even pragmatic
point of view to determine if the surrounding sentence makes sense in the context of the
report.
Words that are particularly susceptible to such insidious errors may need to be replaced
by less problematic words until error analysis reaches a stage where detection is possible.
As an example, “no” might be corrected for by using the words “negative” and “positive”
instead. An immediate challenge to such a solution, however, is the need to convince the
radiologists to modify the way that they dictate.
Post-Recognition Errors
In some cases there were post-ASR errors introduced when the reports were manually cor-
rected by the radiologist (none of these were “strong” errors as per Section 3.1.2). These
errors were detected by the hybrid error-detection system, underscoring the value of a sys-
tem that can provide a second set of “eyes” for the radiologist, beyond ASR, much in the
same way computer-aided diagnosis (CAD) can assist human diagnosis.
6.2.3 General Observations
One of the reasons errors in reports can be difficult to detect by human eyes is that expec-
tations override the actual words present. There is evidence that not every letter in a word
must be read in order to understand it. For example, it is still possible to read a word even
CHAPTER 6. OBSERVATIONS AND COROLLARIES 107
when the first and last letters have been permuted1. Likewise, this effect can be expanded
beyond the word level to the sentence level where the brain completes the sentence not
based on visual perception but rather on the expectation of its content; anyone who has
proofread their own work is likely to have experienced this effect. Thus, when a radiologist
reviews a report, his expectations of what the report should say can have a negative impact
on proofreading. What the error-detection system does is draw the radiologist’s attention
back to certain areas, forcing a closer look. By the same token, this can be extended to
other medical tasks, and beyond, to the general problem of error detection. Recognizing
this tendency lends the technology to tasks beyond error detection to the general problem
of computer-assisted proofreading. For example, by collecting the error statistics for errors
overlooked during manual proofreading, it is possible to characterize the nature of these
missed errors. This can help in understanding the mechanisms that allow our expectations
to obscure the actual word, such as the features of particularly problematic words, which
might include a similar orthography, phonology or even features of the surrounding context
words.
6.3 From a Radiologist’s Perspective
There are many issues with ASR in the reading room beyond the immediate problems
with accuracy. An interview with Dr. Forster revealed a long list of problems with the
software and its integration into the radiology environment beyond accuracy. Many of these
complaints are echoed in the literature by radiologists working with, or considering, ASR
versus traditional dictation methods (see Chapter 2). The following is a list of the most
common complaints. These do not directly pertain to ASR as it has been covered so far in
this dissertation, yet they directly relate to future extensions of the hybrid error-detection
system as discussed in Section 6.9.
Interface Perhaps the greatest complaint aside from accuracy is the interface between the
radiologist and ASR software. Issues include:
Speed There is often a noticeable delay before dictated commands are implemented,
or before dictated text appears on the screen.
1Tihs is an emxlpae of the atilbiy to raed txet beasd on the frsit and lsat lterts aonle.
CHAPTER 6. OBSERVATIONS AND COROLLARIES 108
Navigation In the ideal user interface, complete verbal navigation is not only possible
but painless. In reality, navigation commands are often printed as text instead
of interpreted directly, or ignored completely. Placing the cursor to select and
correct words in the text is complicated, error-prone and time-consuming via
voice commands alone.
Workspace and Workflow The design and setup of the ASR console should result in a
smooth integration with the workstation. From a software-engineering perspective,
conflicts with existing hospital or clinic software arise frequently. Physically the radi-
ologist often deals with poorly adapted equipment, such as corded headsets, and the
challenge of switching from the image to the dictation screen or between modalities,
such as between the mouse and keyboard when verbal navigation fails.
Inconsistent Performance In some cases, ASR performance seems to degrade after pro-
longed dictation sessions, while certain verbal commands result in seemingly random
responses at times.
Inadequate Training The steep learning curve is exacerbated by poor training on the part
of the vendor, and scheduling conflicts among the radiologists or within the hospital
[97].
Chronic Misrecognition: Poor Handling of Special Words or Phrases Due to very
specific cadence expectations, speech recognizers often misinterpret special words or
phrases, such as the following:
Proper names These include patient or clinician names.
Jargon and Acronyms Many highly specialized medical terms are acronyms, such
as “FSE T2” or “C4/5” and are a frequent source of recognition errors.
Postal codes While not frequently dictated in radiology (and less so with systems
that integrate well with the existing patient information system), Dr. Forster
observes that a very particular cadence is required to successfully do so.
Emphasized throughout this dissertation, the utility of ASR in the reading room is
contingent on its accuracy. Consequently, many of the problems listed directly above may
be reduced to inconveniences once the problem of accuracy is solved. Nonetheless, in the
interest of smoothly integrating ASR and ensuring that radiologists remain as productive as
CHAPTER 6. OBSERVATIONS AND COROLLARIES 109
possible, these issues are highly relevant and will help direct the course of future endeavours
as discussed in Section 6.9.
6.4 A Critical Look at the Hybrid Error-Detection
Methodology
Having established post-ASR, hybrid error detection as an effective means to recover from
low recognition rates, it is now possible to turn a critical eye to the methodology in the hopes
for future improvement. As a new theory and application in error detection, the hybrid
methodology is not without challenges. This section examines existing open problems and
weaknesses facing the methodology and within the current iteration. In some cases where
such challenges might be said to overlap, they are presented in the section on Methodology
Challenges.
6.4.1 Challenges Facing the Hybrid Methodology
The following is a list of the current challenges and open problems with respect to the
hybrid error-detection methodology. These will help lay the groundwork for future study
and improvement.
Subtlety in Errors As mentioned above, certain errors are particularly hard to detect.
Deletion errors are especially challenging as omitted words rarely leave a record of their
absence. As a result, the now incorrect sentence remains parseable. Theoretically, an N-
gram model of the domain may detect errors where the omission results in an N-gram with
very low probability. That is, two words that are always separated by some word(s) may
now find themselves adjacent as a result of the deletion error. Unfortunately, this only works
well in the case where the N-gram model is built upon collocations. If the context window,
n, is any larger, the combined result of the co-occurrence probabilities will smooth out the
effect of adjacency.
In addition, while parsing is effective at detecting grammatical errors or concepts that
are in disagreement with the surrounding words in the text, recognition errors do arise
that are not caught within the current defined constraints of the syntactic and semantic
grammar. The hybrid approach means that statistical methods, which characterize reports
by the frequency with which words co-occur with other words in the domain, may detect
CHAPTER 6. OBSERVATIONS AND COROLLARIES 110
recognition errors that the parser failed to detect on the the basis of their infrequency2. Still,
errors do arise that may not be caught by any heuristic, such as contextual errors that may
make sense, for example, in another report. For instance, a report of the knee will describe
certain facts that are relevant to the knee; it is not unreasonable that a recognition error
within a knee report may arise that while grammatically correct and having a relatively
high frequency within the training database, will nonetheless go undetected.
Thus, the nature of errors merits further investigation, including a detailed analysis of
why certain errors go undetected. Implementation details aside, this can only be done once
the problem of insufficient training data has been controlled for (if there is a statistical
component).
Meta-Level Heuristic Interaction It is possible that an error from one heuristic can
be exacerbated when combined at the meta-level with the results from the other heuristics.
This problem of system reliability can be helped with the inclusion of more heuristics with
overlapping error coverage; in this way no one error is determined by the output of a single
heuristic, and thus if an error should be introduced from one heuristic, the overlapping
output will smooth over the erroneous data. Still, careful study of the meta-level interactions
is needed.
Ambiguity As with any NLP application, the problem of ambiguity is ever-present.
Ambiguity arises when there exists more than one interpretation for a text or segment.
This can happen at any level in the anlysis, from multiple syntactic parses, to multiple
conceptual analyses as was introduced by the MMTx software in deciding between UMLS
concept candidates. It may be the case that despite the semantic, syntactic and N-gram
model restrictions on a text or segment, more than one interpretation may still remain.
Depending on the implementation, the system may simply fail at this point, or choose the
wrong interpretation, resulting in either a false positive or a false negative.
Assessing Implementations of the Methodology Beyond the hybrid methodol-
ogy, there currently is no metric for the comparison of existing error-detection systems and
their performance, making comparative analysis difficult. Even matters as “straightforward”
as the word error rate vary within the literature. This is compounded by a lack of ASR
2Presumably an error should be infrequent or non-occurring in the training corpus if built upon correctreports.
CHAPTER 6. OBSERVATIONS AND COROLLARIES 111
error-detection research in radiology. As a result, although the hybrid method outperforms
the individual heuristics in the local domain, it is difficult to compare its performance to the
problem of error detection at large. Still, the hope is that this work will provide a starting
point for comparison of problems in error detection in radiology, as well as inspiration for
expansion beyond the problem of radiology.
What is more, in order to assess the performance of this implementation, and the actual
effect it has in the radiology reading room, a clinic must be found that is willing to have
the system integrated within their current ASR setup.
Data Standardization There is a clear need for standardization in the representation
of medical knowledge that will effect eventual extensions of this methodology to automated
correction and summarization (these are discussed in Section 6.9). Furthermore, the field
must see a standardization in the vocabularies and their interfaces, such as the UMLS,
required by many applications of MLP, including the hybrid error-detection methodology
(see Appendix B). By building a successful foundation now, it will be possible to fully
integrate systems hospital-wide, from radiology to paediatrics, while making information
available across the country and beyond via the Internet. Accurate statistics on past cases
can then easily be collected and used for research, patient care and decision support.
Adequate Domain Coverage Any implementation relying on a semantic or ontolog-
ical component faces the challenge of limited domain knowledge; a system that is too broad
is over-general and suffers a loss of accuracy [45], while a system that is insufficiently gen-
eral may not provide enough coverage for the domain at hand. Thus, a conceptual distance
metric risks being mired in an overly detailed ontology, or failing as a result of insufficient
distinction between the terms (that is, all distance measures will be too small to be of any
use).
Accuracy Despite the degree of error detection in the implementation provided, it is
still far from the goal of 100% accurate. If a system is to be deployed in a medical setting
where it is responsible for handling sensitive data, it must have extremely high accuracy. If a
report is returned to a requesting physician mistakenly identifying a disease or lack thereof,
the consequences could be fatal. As an extension of this, the system must have a strong
integration with the existing PACS3 and hospital information system (and potentially any
3Picture Archiving and Communication Systems.
CHAPTER 6. OBSERVATIONS AND COROLLARIES 112
ASR system already in place) so as to avoid additional errors being introduced throughout
the reporting process.
Data Sparseness As underscored by the existing implementation, data sparseness
and the overall quality of the training corpus is always a potential problem and must be
kept in mind for all statistical analyses. In Section 6.8 this problem is discussed in more
detail.
Choosing the Right Heuristics The choice of heuristics implemented in the hybrid
method is influenced by a number of factors. As mentioned previously, having multiple
error-detection methods with overlapping range of coverage can help increase overall system
reliability. In some cases, it may be known beforehand that only certain error levels or
types are relevant, which may limit the choices of heuristics or influence the choice of one
method over another. For instance, in ASR applications lexical or orthographic errors are
not relevant and have no bearing on the system. In contrast, web-based analysis such as
the study of weblogs, is likely to contain colloquial spellings and other variants, all of which
the system must take into account.
6.4.2 Challenges Facing the Current Implementation
The following is a list of the weaknesses within the current implementation of the hybrid,
error-detection algorithm.
Reliance on External Knowledge Sources Relying on the UMLS for the ontolog-
ical component means the implementation is susceptible to the weaknesses of that ontology.
For example, incomplete coverage means that occasionally a valid medical term is found in
a report that is not found in the UMLS. When a NULL value is returned on a legitimate
word, this disrupts the ability of the system to accurately detect errors. This problem was
exacerbated by the problems in the MMTx in the inconsistent handling of legitimate terms
unknown to the ontology or that may only have an entry in the Metathesaurus, for example.
Data sparseness Common to both probabilistic approaches is the insufficiency of the
training data. While sufficient for proof of concept, a larger corpus is needed to improve
the accuracy and reliability of the statistical heuristics. Many of the problems with respect
to false positives (and consequently the low precision rate) were attributable to a legitimate
medical word’s absence in the training data.
CHAPTER 6. OBSERVATIONS AND COROLLARIES 113
Incomplete Information As an initial attempt at the problem of low recognition
rates in radiology reporting, the goal was recovery from recognizer-induced errors. These
are errors that occur despite correct input. This discounts any input where the user may
have faced some speech impediment (such as a cold), or unexpected ambient noise was
present. Accounting for these problems is not possible without access to the corresponding
audio tracks, therefore a more in-depth analysis of the recognition errors was not possible
at this stage.
Assessing System Performance The process for identifying recognition errors de-
scribed in Section 5.2.3 is susceptible to inconsistent interpretations and does not represent
the best way to identify errors. A deeper analysis of the causes underlying consecutive errors,
in particular, is needed before an automated error collection system can be developed.
Corpus Bias As mentioned in Section 5.1.1, the training corpus was produced on MRI
reports alone, yet tested on a corpus that included both MRI and CT reports. While the
radiological parlance is similar in both MRI and CT dictations, it is important to recognize
the potential for bias this discrepancy introduces. For example, reference to images specific
to a particular imaging technique may be found in the “Findings” section4. Nonetheless,
when the current test corpus was split and the results separately tabulated for those reports
which were MRI-based and those which were CT-based, no difference was noted in the
performance of the statistical algorithms on either report type. Still, a larger test corpus is
needed to confirm this finding.
Also, since the training corpus was obtained from one clinic only, there is risk for further
bias in the data. Therefore, it is important that in developing or expanding the training
corpus a greater variety of reports be obtained. This includes a mix of MRI and CT reports
(as well as any other imaging report to which the error-detection algorithm may be applied),
along with input from other clinics.
4The greatest potential disparity, however, is within the “Techniques” section of the report, which isexcluded in this research since it is a template selected by the user and not likely to contain any errors (asdiscussed in Section 5.1.1).
CHAPTER 6. OBSERVATIONS AND COROLLARIES 114
6.5 Corollaries
There are a number of implications stemming from the conclusion that post-ASR, hybrid
error-detection is an effective means to recover from low recognition rates in radiology report
dictation. These are divided into immediate and longer-reaching consequences.
6.5.1 Immediate Implications
The classification of error-detection methods presented in Chapter 3 enables the objective
discussion of existing and future error-detection techniques. This will assist in developing
gold standards both within and outside medicine, making it easier to develop and assess
new error-detection technology.
What is more, the proof of concept from Chapter 5 provides an immediate roadmap for
the development of a system for actual use in the radiology reading room as discussed in
Section 6.6. Combined with the conceptualization in Chapter 4, this will allow improvements
and extensions over the current implementation. As an immediate consequence of a viable
application, radiologists will have another weapon against the problems currently plaguing
ASR in the reading room. Improving the experience with ASR will encourage other radiology
clinics to upgrade without worry of a reduced net performance either in efficiency or in report
quality. A highly reliable ASR system will remove the need for transcriptionists, while an
automated error-detection system will allow the radiologist to proofread and correct his
own reports efficiently. The result is improved report handling and turnaround time (TAT),
improved report quality, and, finally, improved patient handling.
The strength of the hybrid error-detection method over the reliance on any single heuris-
tic also has implications in the development of the meta-level analysis of the component
heuristics and their interactions. The nature of this interaction is important in further-
ing our understanding of how various levels of linguistic knowledge, both probabilistic and
non-probabilistic, work together to form a coherent analysis.
6.5.2 Implications for Future Study
Beyond the immediate implications of this thesis, there are also farther reaching conse-
quences. On a larger scale, the error-detection system (and subsequent advancements) will
help mitigate the difficulties of the transition from traditional dictation methods to ASR-
based systems, a transition that some are now citing as inevitable (see Chapter 2).
CHAPTER 6. OBSERVATIONS AND COROLLARIES 115
The processing in the error-detection system lends itself quite naturally to the problem
of report summarization. What is more, the ability to detect errors in such cases is especially
important since not only must the summarizations be correct for the current patient, but
as electronic records they are likely to find use in subsequent research.
The ability to quickly create an electronic record of a report helps streamline the re-
porting process, resulting in radiology reports that are available throughout the hospital
(via the hospital information system), and remotely to clinics. Doctors waiting for results
will receive them as soon as they are complete, radically improving the TAT. This leaves
open the possibility for efficient tele-radiology operations, or remote consultations between
radiologists, that otherwise might not be possible with multi-day TATs. In addition, pro-
viding well-structured reports will allow clinicians to easily search past cases and perform
statistical analyses, making these reports accessible to both further research and decision
support.
6.6 A Standalone Application for the Radiology Workstation
On its own, the hybrid system from Chapter 5 is nothing more than a promising idea for
post-ASR error-detection. To show the true value, the system must be integrated and tested
within an actual radiology reading room. This section examines what is required to turn
the current software into a program for practical application.
Figure 6.1 shows the error-detection process.
6.6.1 Steps to an Independent System
As it exists now, the hybrid error-detection system is a juxtaposition of various heuristics
that are manually applied to the test corpus. If the system is to become a standalone
application, a front end must be designed that will handle running the various heuristics
in parallel. Since the system runs as a post-processing stage, the output from the speech
recognizer can be provided as external input to the error-detection analysis. The analysis is
then performed and the results automatically output in a format which the radiologist can
modify or correct.
CHAPTER 6. OBSERVATIONS AND COROLLARIES 116
Interactive front end:
SROutput
InterimReport
ReportDictation
Corrections
User
FinalReport
ErrorDetection
Figure 6.1: The error detection process.
Output
Currently, the results of the error-detection process are collected by hand. In order to achieve
application independence, these results must be automatically collected and displayed in a
user-friendly manner. The nature of this display has been given considerable thought and
depends upon the final choice of mappings (the error tag-set {correct,incorrect} or raw
confidence scores).
In the current, manual collection of results, the error-tag-set mappings as opposed to a
confidence score or percentage, are provided as the final output. This reflects the error-tag
mappings assigned by the individual heuristics. For example, if at least one heuristic maps
a word in a report to incorrect then it will be mapped to incorrect in the final output (as
per the the discussion in Section 4.2). These results are compiled by hand and collected in a
text file as a list of word-tag pairs. In general, such a format demonstrates poor readability.
Needed is a script that applies these tags to the actual report in an easily observable format.
For instance, all words with an “incorrect” tag may be coloured red within the body of the
report to alert the radiologist’s attention to these problem areas.
Given a more complicated aggregation of heuristic results that returns a confidence or
percentage score, the combined confidence value of a word may be displayed as a grey- (or
colour-) scale representation of the report, as in Skantze’s work on error detection in spoken-
dialogue systems [135]. This allows a radiologist to immediately and visually characterize the
CHAPTER 6. OBSERVATIONS AND COROLLARIES 117
“Possible spondylolysis eye laterally of L5.
If clinically indicated, CT scan could be
performed for further assessment, but no
spondylolysi cysts is seen. Advanced
degenerative disease at the L-2/3 level.”
1. Possible spondylolysis eye laterally of L5.
2. If clinically indicated, CT scan could be performed
for further assessment, but no spondylolysi cysts is seen.
3. Advanced degenerative disease a the L-2/3 level.
Figure 6.2: Sample output using a grey-scale confidence indication.
state of the report. As discussed in Section 4.1, from the perspective of medicine some might
suggest that all errors should be considered significant errors, and thus mapping directly to
{correct,incorrect} with a corresponding binary colour scheme may be more desirable than
a gradient representation that admits degrees (that is a word is either misrecognized or it
is not).
Figure 6.2 is an example of how the error confidence information may be conveyed in the
final output; the sentence “*Possible spondylolysis eye laterally of L5” is a misrecognition of
the sentence “Possible spondylolysis bilaterally of L5”, “*spondylolysi” is a misrecognition
of “spondylolysis”, and finally “a the” is an insertion error.
6.6.2 User Interface for the Hybrid Error-Detection System
The user interface of any practical application is the face by which we judge its overall
quality. If a program is cumbersome, difficult to learn, or difficult to operate, it will not be
accepted by the radiology community, which has already shown an understandable resistance
to poorly integrated software [97]. From a purely functional perspective, a system that does
CHAPTER 6. OBSERVATIONS AND COROLLARIES 118
not interface effectively with its user will not run efficiently, irrespective of the computational
efficiency of the system itself.
The application of the error-detection system on the recognizer output should be con-
trollable via the main dictation window (for example, as a Microsoft Word macro). Once
a radiologist has dictated the report, he has the option of choosing to run the resulting
dictation through the error-detection system, or setting the system to run automatically
following report dictation (some error-detection systems allow user-defined commands that
could be linked to the error-detection system and called at the end of dictation). On com-
pletion of the analysis, the report, complete with error mapping, is then available via a
word-processing interface (with possibilities for later expansion via suggested correction
candidates, as discussed below in Section 6.9).
Though the exact nature of the word-processing interface is open to speculation, it
must include facilities for correcting errors via the keyboard/mouse or through further voice
commands. As a future extension, a facility for suggesting correction candidates will allow
the radiologist to simply click the appropriate correction and immediately replace it without
further typing, or switching of modalities (i.e. from mouse navigation to the keyboard, or
vice versa).
6.6.3 Miscellaneous Requirements
Since the performance of the system relies on the presence of threshold values that determine
the degree of error filtering, a useful extension is the presence of a “slider”-based interface
that allows radiologists to control the extent of filtering, depending on their preference (and
the task at hand).
6.7 Measuring the Real-World Success of the System
While assessing the accuracy of the hybrid error-detection system is a useful indication of
the quality of the software itself, it does not reflect the system’s performance with respect
to the actual radiology environment. This performance is a product of not only the software
calibre, but the integration with the existing ASR software and user interface, and addresses
the question, does error-detection augmentation equal or surpass the TAT efficiency of
traditional methods? Thus, any standalone error-detection system must be assessed in the
radiology suite and the report TATs measured and compared against traditional dictation
CHAPTER 6. OBSERVATIONS AND COROLLARIES 119
methods, as well as ASR without error detection. Although a positive effect on the TAT
is expected, it is impossible to assert this as fact without evidence from studies within a
radiology reading-room. Furthermore, the magnitude of improvement over standard ASR
(and across vendor systems) must be measured, as well as the differences between ASR-
augmented-with-error-detection versus non-ASR, traditional methods.
6.8 Data Sparseness: Smoothing
As mentioned in the discussion of the co-occurrence results in Chapter 5, Section 5.2.9,
a zero probability can indicate a failure of the training corpus to provide an adequate
representation of the words within the domain, otherwise known as the problem of data
sparseness. With this in mind, it makes little sense to treat a zero probability as actually
zero, but rather as an inaccurate assessment. Since the training corpus is at best a subset
of the domain, it is impossible to generalize beyond the training corpus to conclude that
any string is impossible, especially if that string is considered “correct” by the domain’s
standards, such as in the case of a false positive. Given that creating a corpus that contains
every single possible word and its environments is an impossible task, it is not possible to
know the “true” meaning of a zero probability relative to the domain. That is, we must
decide whether a zero probability means that the word or N-gram is so rare that it is not
in the corpus, or that the word or N-gram does not occur in the domain.
Although a larger context window size can provide more information better character-
izing a word and its features, data sparseness means that within this larger window the
probability of encountering a word pair that has not occurred in the training corpus is in-
creased (and therefore the likelihood of having to handle a zero probability) [96]. Even an
exceptionally rare word, which would have a minimal probability of occurrence within the
training corpus, may never have occurred in the relevant N-gram in that corpus, rendering
the results unreliable [96].
The answer to this problem is a technique called “smoothing” [96, 81], which modifies
all probabilities within the training corpus to reduce the effect of data sparseness. A simple
example considers only the k most common N-grams, and discards all other words as “out of
vocabulary” (OOV) words [96]. Manning and Schutze observe that this serves two purposes:
to smooth the resulting probability distribution and reduce or eliminate the presence of zero-
probability words or N-grams; and, to reduce the memory requirements by reducing the
CHAPTER 6. OBSERVATIONS AND COROLLARIES 120
parameter space (i.e. the smaller training corpus) [96]. For the purposes of error detection,
however, reducing the training corpus risks increasing the false-positive rate unacceptably.
In another example of smoothing, zero- or low-probability words and N-grams are re-
assessed and their probabilities modified to better reflect the domain. One naıve method is
to normalize the data by adding 1 to all probabilities. While a straightforward solution, this
affects the distribution of probabilities at all levels, not simply the low frequency ones, and
consequently results in poor estimates that are at times a few orders of magnitude out [53].
Alternatively, a technique called “Witten-Bell smoothing” uses the probability of extremely
rare occurrences (“things seen once”) to estimate those never seen based on the assumption
that a zero-probability occurrence just has not happened yet [96]. That is, the probability
of a unique occurrence is estimated upon having seen such unique occurrences in the past
(where a “unique occurrence” is any occurrence seen only once in the training corpus) [96].
It is also important to note that some words are more likely to occur preceding a unique word
than others. Calculating how often each word precedes a single-occurrence word (the best
estimation of a “new” word), it is possible to determine that word’s likelihood of following
a new word [96].
In conclusion, N-gram, corpus-based probabilistic methods, such as those used for the
error-detection analysis, are susceptible to problems of data sparseness. Any training corpus
is only an approximation of the distribution of words within that domain. Once a larger
(or improved) training corpus has been obtained, experiments with N-gram size can be
conducted with specific attention to word distribution: where distribution can support
larger multi-gram analyses, it should be used [96, 199-202]. Furthermore, a consideration
of smoothing techniques, such as Witten-Bell, may provide relief from the problem of data
sparseness (especially if a larger data corpus does not improve the results).
6.9 Future Work
Perhaps the greatest contribution stemming from the hybrid, error-detection methodology
exists as a function of a larger system. As mentioned above, although the research so
far is sufficient for proof of concept, further development is necessary for this software to
be of actual use in the radiology setting. The error-detection system nonetheless leaves
open the possibility for many future developments to improve the system with respect to
error detection and beyond. This section takes a look at some of the immediate extensions
CHAPTER 6. OBSERVATIONS AND COROLLARIES 121
possible, followed by a look at possibilities in the more distant future.
6.9.1 The Full System
Beyond error detection, the work here can be expanded into a full report analysis system,
the report analyser. As discussed in Chapter 2, such a system involves natural language
processing of reports to produce a computer-accessible (i.e. searchable and updateable)
summarization of a report, a subset of which involves error detection and correction. Since
much of the analysis required for a successful error-detection system overlaps that required
for a full report summarization system, including in-depth syntactic and semantic analysis,
expanding to automated summarization is a natural step. Combined with the value of
summarized reports, it makes sense for the report analyser to be the eventual goal of post-
processing in radiology.
Figure 6.3 shows the full system as envisioned. The user has three possible actions within
the system: dictate report; correct existing report; or query the report database. When dic-
tating a report, the text is collected via ASR and then run through the report analyser. The
report analyser performs an in-depth linguistic analysis and creates a computer-accessible
XML representation of the knowledge within the report, which will support future natural
language user queries. During this analysis, the report analyser applies the error-detection
algorithm to tag the summarized report, displaying the results as a copy of the original
dictated text with error tags. The user is able to make any corrections (or in the case of
an automated correction system to review the corrections made), and finally to “sign off”
on the report as correct and complete. This summarized and signed report is added to the
report database, which the user has the option to query. Figure 6.3 shows the database
query engine as a dotted line since this avenue of research has not yet begun.
The reliance on a XML-based report representation is beneficial for two reasons: it
ensures the adherence to standards in representation that will make integration with other
systems uniform; and, the final reports are in a web-ready format that will allow transmission
throughout the hospital, and even remotely to doctors at external clinics.
CHAPTER 6. OBSERVATIONS AND COROLLARIES 122
Report
Correction
ReportError
Correction
DB
DatabaseQuery
Interactive front end:
User
Store/retrievereport
Query/Update
Analysis
Query
Dictation
Figure 6.3: The full system as envisioned.
6.9.2 Immediate Extensions: Improving the Current Heurisitcs
Improving the Statistical heuristics
As mentioned in the discussions of both probabilistic heuristics, the lack of training data
has a detrimental impact on the results of N-gram-based models due to data sparseness.
Therefore, the immediate task is to expand the training corpus beyond the current 2751
reports. While the “more is better” approach is generally adequate in the face of corpus
design, the creation of a balanced corpus that better reflects the expected distribution of
radiological text may be a worthwhile approach, given the limited domain and in the interest
of tractability. Although a large domain, radiology nontheless does not exhibit a wide variety
of expressions within its reports. This could make it possible to intelligently select a training
corpus that would best represent the domain. The problem of rare words may still remain,
however, and may call for smoothing to be applied.
Currently, there is no pre-processing on the text used in the statistical approaches. As
a result, all orthographic variants of words, regardless of tense, et cetera, are considered as
independent terms; that is, “examine”, “‘examined” and “examines”, for instance, are all
considered as separate, independent terms within the analysis. An interesting experiment is
to stem the words in the text to see if it has an impact on the performance. Hypothetically,
stemming should reduce the variety within the training corpus thereby increasing its ability
to generalize.
CHAPTER 6. OBSERVATIONS AND COROLLARIES 123
By expanding the definition of stop word, lower content words, which have a higher
frequency in a text, can be added to the stop list. The information content of a word is
closely related to its frequency in the training corpus. The lower the probability of that word
occurring, the more information it brings to bear on the surrounding context. Likewise, the
higher the probability, the less information it has to offer (in the extreme case, consider a text
containing only one word repeated multiple times – that word would have zero information
content). Thus, expanding the stop list to include very high frequency words can reduce
the reliance on these words, theoretically improving the error-detection performance.
Some previous work in N-gram models includes sentence boundaries and punctuation
when analysing a training corpus. This may be a useful extension as many sentences within
radiology reports are short, assertive sentences, of a limited form. Constraining the words
further based upon sentence boundaries might help characterize this property of a report
within the training corpus.
Lastly, further work on the division of training corpus by type and/or section may be
helpful, based upon the results in Table C.1 and discussion in Section 5.2.9. A larger test
set containing also divided by anatomical region is needed.
Improving the Syntactic Analysis
The parsing heuristic (discussed in Chapter 4) is based upon a basic CHR property grammar
(see Appendix A). Immediate extensions include a finer characterization of the domain
with a more extensive property base, including tense, person, voice, number and mood. By
increasing the variety of sentences recognized by the parser, the presence of false positives
caused by parser failure is reduced (as opposed to an actual error in the text triggering an
incomplete parse).
Improving the Semantic Analysis
The semantic distancer measures the conceptual similarity between two concepts as a func-
tion of their edge distance within a semantic network, as described in Chapter 4. Although
the present implementation is faced with a number of difficulties, with careful consideration
these are not insurmountable. One of the inherent difficulties in the current approach is
the reliance on the MMTx software (see Section 5.2.6). MMTx was chosen as a readily
available program that would allow the error-detection system to interface with the UMLS.
CHAPTER 6. OBSERVATIONS AND COROLLARIES 124
Unfortunately, since MMTx was not designed specifically for this purpose, there were several
difficulties encountered that have delayed the success of the semantic distancer as discussed
in Section 5.2.6. Additionally, it is not possible to supply a stop list, which ultimately could
have an effect on the efficiency of the system given the size of the UMLS. The design and
development of a specialized program to interface with the UMLS dedicated to the task of
error detection will help with these difficulties, and is left to future work.
Due to time constraints and keeping the implementation to a reasonable developmental
timeframe, the semantic grammar as laid out in Chapter 4 remains to be developed. To
this end, an in-depth analysis of the radiological archetypes that can be used to develop
semantic-grammar rules is required. As a first step, a concordance analysis of the training
corpus will help reveal those constructions most common to a radiology report. From these
it is possible to abstract semantic rules for use in a semantic grammar. Recall that the
syntactic grammar developed has been created with these extensions in mind (see Section
5.2.7) and can be easily extended to accommodate semantic rules. Given a careful analysis,
once a selection of the semantic properties of radiology reports have been extracted, a
translation by hand into semantic rules or constraints will provide a proof of concept, with
an eye to automated rule induction in later stages.
6.9.3 Miscellaneous Improvements
The dictation of individual letters is a known problem within ASR, with 20 of the 26 letters
of the English alphabet, for instance, causing difficulty [127]. Rolandi notes that not only
are English letters mono-syllabic for the most part (the exception being “W”), many of
the letters sound similar and can be grouped into what he calls “confusion classes”5 [127].
Naturally, this results in difficulties dictating acronyms, or single-letter codes in medicine,
such as “C4”. These difficulties may be partially addressed by the expansion of acronyms;
for example, the expansion of “C4” into “cervical 4”. Zahariev developed an automated
acronym-expansion algorithm designed to function as part of a larger NLP system [164].
This may offer some help in the existing error-detection infrastructure proposed here, po-
tentially helping to solve a subset of the problems surrounding acronyms and abbreviations.
Perhaps the most potential for success with acronyms, however, comes with a change in ra-
diologist dictation habits; teaching radiologists to dictate expanded acronyms. The tradeoff
5For example, {“F”,“S”,“X”}, or {“M”,“N”}.
CHAPTER 6. OBSERVATIONS AND COROLLARIES 125
will depend on the extra time needed to dictate a longer phrase, versus the frequency of
acronym errors and the time spent on corrections.
As laid out, the hybrid error-detection algorithm provides feedback to the user as post-
processing stage following dictation. It may prove useful to adjust the system so that
detection occurs in a more “on the fly” manner, whereby radiologists are given some feedback
on the error status of their dictation as it is being dictated. This might involve tagging the
text as it is dictated. As-you-go tagging, however, may only allow certain heuristics to
function to their full efficacy without the benefit of the text from the entire report (a dual-
pass mechanism might be derived whereby some errors are detected on the fly, with a full
error check following dictation).
From Detection to Correction
Once the errors are identified in a report it falls on the radiologist to review the report and
implement any needed corrections. This process can be made more efficient by the presence
of a well-thought-out user interface. Currently, following ASR dictation, the radiologist can
switch to the dictation screen and, via the mouse and keyboard, implement the necessary
changes. In many cases making changes using the voice interface is challenging and time-
consuming, while the speech recognizer’s suggested “corrections”, intended to streamline
the correction process, are often far from relevant.
Once a system is in place for detecting errors, however, it is possible to use the infor-
mation from the error analysis towards intelligently suggested corrections. For example,
a semantic-based analysis can reveal clues as to the semantic type of the expected word
or phrase, while a syntactic analysis can reveal the expected part of speech. In addition,
the probabilities determined based upon the N-gram model of the domain can be added
to this information to help narrow down the list of possible candidates. When augmented
with the N-best list from the recognizer, this can help create a more intelligent suggestion
list. While a black-box method does offer a “fresh perspective” on confidence ranking of
recognized terms, independent of any recognizer influence, it is still useful to have access to
the internal workings of the recognizer when possible.
While arguably the ultimate goal of a “full-service” summarization system is not only au-
tomated error detection, but automated error correction as well, any procedure in medicine
that removes human eyes from a process introduces the risk of undetected errors. The pur-
pose of the radiology application of the error-detection methodology is to add reliability and
CHAPTER 6. OBSERVATIONS AND COROLLARIES 126
reduce the risk of user error. A fully automated system may risk the introduction of new
errors altogether, thus its development, design and integration will be difficult and further
study is required on the expected impact and reliability.
6.10 Summary
This chapter explores the consequences arising from the theoretical and experimental work
from previous chapters, both within radiology, and within error detection and report sum-
marization in general. An investigation of the challenges faced by the current approach,
and suggestions for improvement is presented. These provide a natural segue into a vari-
ety of avenues for future work that include improving the current hybrid system, adding
further heuristics, and the development of a full report analysis software system that takes
advantage of and extends the processing already done towards error detection. In the next
chapter, the wider application of the hybrid error-detection methodology will be explored.
Chapter 7
Beyond Radiology
7.1 Error Detection in the Greater Context
While post-ASR, hybrid error detection is an effective means to recover from low recogni-
tion rates in radiology report dictation, there exist valuable applications beyond the medical
domain. This chapter explores two such applications in the fields of cognitive science and
general natural language processing. These discussions are not intended as a proof of con-
cept, but rather to demonstrate that the hybrid, error-detection methodology is applicable
to a larger context, and to inspire future work.
7.1.1 The Methodology in Other Domains
Although the application of the error-detection methodology has focused on radiological
functions alone, the methodology itself is sufficiently general to be extended to other prob-
lems of error detection from other areas of medicine, the World Wide Web, and beyond. In
general, this requires sufficient domain information in the form of an ontology to support
the semantic analysis (the UMLS would suffice for any medical applications, for example)
as well as an update of the semantic rules, currently representing radiological archetypes,
to reflect a new domain. Lastly, a database of text samples in the new domain on which to
train the statistical algorithms is also needed. Still, despite these requirements the general
challenge of adapting to another domain is not unreasonable, and the methodology may
find favour in other fields.
The black-box error-detection techniques discussed so far are not dependent on input
127
CHAPTER 7. BEYOND RADIOLOGY 128
from ASR and therefore can be extended to error-detection tasks from different sources, such
as the general problem of computer-assisted editing. This can include improved utilities for
word processing, the World Wide Web, and more.
Within some domains, the choice of heuristic may be influenced by constraints on the
domain itself. Therefore, it may not be possible to collect an adequate training corpus
to develop an N-gram model of the domain. While this may be especially true in highly
constrained domains, it may be offset by the ability to develop highly accurate parsers based
on the limitations of expressions occurring within the domain. Similarly, some domains may
not enjoy the same degree of constraint in the possible range of concepts, or likewise may
have fewer grammatical restrictions on the words in the language. Both of these conditions
would preclude the use of conceptual distance (with too many varied concepts the measure
of distance loses predictive power), and certain highly-specified semantic analyses, such as
domain-specific verb complements.
An automated error-detection system allows for text quality assessment that is not
susceptible to errors or bias in human detection, and provides opportunities for higher level
statistical analysis that is not possible by humans alone.
As a final general note, a large domain necessitates a larger information base (e.g. train-
ing corpus, ontology, lexicon, et cetera), which increases the processing time and in turn
may limit the usefulness in some domains.
The following sections take a closer look at two particular applications of the hybrid
error-detection methodology outside of medicine: cognitive science and machine translation.
7.2 Cognitive Science Perspectives on Error Detection
Cognitive science is the scientific study of the mind and intelligence from a multi-disciplinary
perspective; it sits at the intersection of the broad areas of neuroscience, linguistics, psy-
chology, philosophy, and computing science. Specifically, cognitive scientists are interested
in cross-applying the methodologies and theories from these fields in an effort to understand
cognition: the mental operations and structures relating to the brain. This section examines
an application of the hybrid error-detection methodology to cognitive science, specifically
the sub-area of psycho- and neuro-linguistics.
CHAPTER 7. BEYOND RADIOLOGY 129
7.2.1 Error Detection: Applications in Neuro- and Psycholinguistics
Neuro- and psycholinguistics are two approaches to linguistics that fall under the purview
of cognitive science1. Although overlapping, in general, neurolinguistics sits at the inter-
section of linguistics and neurology, while psycholinguistics focuses on linguistics and the
psychological aspects of language. In particular, neurolinguists are interested in the struc-
ture underlying language in the brain, as well as the relation of the various components of
language (lexicon, syntax, semantics, phonology) among themselves, and correspondingly
to the structure of the brain itself. Of primary interest are the neuropsychological linguistic
mechanisms driving language and grammar. Although closely related, within psycholinguis-
tics, researchers are focused on the psychology of linguistic behaviour, including first and
second language acquisition, and the mental representation of language.
Toward furthering the research in these fields, computing science often finds its niche
in the creation of applications modeling current theories. In general, there are two (over-
lapping) goals within computer applications in cognitive science: to model, replicate, and
improve on human mental capabilities; and, to further the understanding of the human mind
through computer models. While the former concentrates on artificial intelligence with the
ultimate goal of meeting or exceeding human intelligence, within the latter, the goal is to
recreate human intelligence complete with human inadequacies in an effort to learn how
the mind works. When focusing on understanding actual human intelligence, creating an
adequate model to test one’s theory is a challenging and open problem that plagues many
subareas within the field.
The multi-heuristic method of the hybrid approach to error detection is useful from
the perspective of modeling human language representation and processing in children and
adults. Although the nature of the representation and handling of language remains open
to debate, several promising theories have been put forth. These theories exist at various
levels of representation, including the conceptual, rule-based, logical and image-based mod-
els; as well as a hybrid, multi-representational account of the mind and language processing
that spans these models [146, 64, 151]. The performance of the hybrid language analysis
present in the error-detection system can offer some insight into these arguments with the
1This information was compiled in part from the information provided athttp://www.nytud.hu/depts/neuro/index.html (last accessed February 16, 2006).
CHAPTER 7. BEYOND RADIOLOGY 130
demonstrated increase in performance from the multi-heuristic approach. If semantic, syn-
tactic, and statistical methods can be shown to outperform such methods individually, this
may offer some support to the hybrid-representation theories and further the theory of the
modularity of language.
Furthermore, capturing errors via a multi-level technique allows for error analysis at
each of these levels, leading to a more in-depth characterization of the nature of errors.
This includes understanding the similarities and differences in error detection at each level,
which will in turn help future error-detection software as well as contribute to a deeper
understanding of the nature of these errors.
Non-Invasive Techniques for Speech Pathology
As an extension of this discussion, studying acquired cognitive deficits, such as speech
pathology, requires an understanding of the representation and handling of language within
the brain. While studying such pathologies can lead to better understanding of natural
language processing in computers, the reverse is also true – the modeling of such deficiencies
through natural language processing and knowledge representation can lead to a better
characterization of the nature of the pathology. This may be useful in the diagnosis of such
disorders and in the development of assistive technology for speech-impaired individuals.
The hybrid error-detection methodology is an example of a tool that can be applied to
cognitive language deficiencies. Acting as a model of the errors present within a patient’s
speech, the methodology can be applied to help characterize and diagnose such errors, and
could have ultimate extensions in an error correction utility for sufferers of such afflictions
(such as the case of high-functioning individuals).
Depending on the extent of injury, the language impairment may be selective (referred to
as aphasia), such as the inability to form syntactically appropriate sentences, or to correctly
interpret the meaning of words of a sentence, resulting in errors when speaking [151]. The
selective hybrid error-detection mechanism allows a more precise measurement of the injury
based on the distinctive and discriminating errors made when the patient is speaking, which
can include errors involving the lexicon, syntax, semantics, or even morphology.
The hybrid approach to error detection readily lends itself to division by error type. Such
knowledge of a speaker’s language and the corresponding errors can help in broadening the
understanding of the related processes within the brain (for instance, if statistical analysis,
CHAPTER 7. BEYOND RADIOLOGY 131
as suggested by theories of “global lexical co-occurrence”2 [92], occurs in a fashion separate
from syntactic or semantic processing within the brain).
In addition to speech pathology, the error-detection algorithm is useful in developmen-
tal studies, such as formal and informal assessment procedures for syntactic, semantic, and
pragmatic aspects of oral and written language (including pathology diagnosis, as mentioned
above). By providing a computerized analysis of language samples, language efficacy can
be tested and qualitatively evaluated. This includes development aspects such as the order
of acquisition of various syntactic and morphological processes, as well as the nature of
errors made by children in spontaneous speech. This gives rise to comparison studies be-
tween developmentally delayed and normal children to help understand and diagnose specific
language impairment.
Finally, the modular design lends itself to expansion by further heuristics, which may
be used to test other cognitive theories that pertain to language and language pathology.
Within neurolinguistics, the error-detection results can be used to validate existing theories,
or postulate new ones by comparing the actual patient errors with the ones predicted by
current aphasic grammar model and corresponding theories.
Other Applications
Aligning the output of the error-detection analysis of speaker output with a magnetoen-
cephalography (MEG) or electroencephalography (EEG) study, if time stamped, would
allow errors and their type to be correlated with specific brain events or activation bursts.
This may in turn offer insights as to the processes occurring alongside each type of error. Al-
ternative forms of analysis, such as PET (and SPECT) and functional MRI, which monitor
glucose metabolism and changing blood flow to show patterns of activity within the brain,
can also be correlated with error occurrence. Such knowledge may further understanding
of the interaction between the multiple linguistic levels of processing in the brain and allow
for more in-depth functional mapping.
7.2.2 Error Detection and Language Acquisition
Despite an extensive history in the literature, many aspects of first-language acquisition in
children remain open problems [92]. This section takes a brief survey of areas in which the
2Briefly, this refers to a word’s co-occurrence statistics.
CHAPTER 7. BEYOND RADIOLOGY 132
hybrid error-detection methodology may find application.
Meaning Acquisition
A particularly difficult issue in language acquisition is that of meaning acquisition, a topic
which has generated two major hypotheses: the semantic bootstrapping hypothesis, and
the syntactic bootstrapping hypothesis [57, 109, 110, 88, 55]. As Li et al introduce, the
semantic bootstrapping hypothesis postulates that children learn syntax based on the un-
derlying semantics of language. This is driven by the ontological mappings of the world,
which constrain valid sentence construction [92]. In the syntactic bootstrapping hypothesis
the reverse is true: children glean semantic knowledge based on the grammatical context
of words and their corresponding semantic classes. Li et al explain that the underlying
assumption is that such classes are partly constrained by the syntax, meaning only certain
classes occur in certain syntactic scenarios [92].
Pinker [111] has suggested that syntactic boostrapping can only give rise to knowledge of
categories of meaning, as opposed to actual “semantic content”. If one expands the context
of a word to include that word’s “total experience in the context of all other words with
which it co-occurs” [92, page 168], Li et al show that this criticism is no longer valid. This
“global context” can be likened to the co-occurrences analyses discussed in Chapters 4 and
5 and is not restricted to the immediate grammatical environment. This is in contrast with
the “local context”, which refers to only the immediate grammatical constraints acting on
a word (such as complement clauses) [92].
The Critical Period Hypothesis
The critical period hypothesis postulates a period of optimum language acquisition from
birth to puberty, after which the ability to learn a language sharply diminishes [138]. Sup-
porters of this theory suggest that such a change is due to the ways in which the brain
processes information past puberty. Evidence for this theory is in part shown in children’s
remarkable ability to develop a full grammar despite insufficient input (e.g. a lack of negative
grammar examples)3 [138]. Still, the nature of language within the brain is not completely
understood. Evidence of varying rates of second-language acquisition beyond puberty, as
3This is referred to as the argument from the poverty of stimulus.
CHAPTER 7. BEYOND RADIOLOGY 133
well as changes in attitudes towards learning and continuity of learning (i.e. one may ac-
tually fall out of practice as opposed to a deterioration of brain capacity) offers serious
questions that the critical period hypothesis must address.
With this in mind, understanding the nature of errors made at different points in lan-
guage development may help delineate any relevant age markers surrounding the so-called
critical period. Error-detection analysis may reveal evidence for a “language threshold”
after which the nature and number of errors made may drastically change. Alternatively,
it may instead reveal a gradual degradation in learning that is not marked by a sudden
decrease in ability, which may offer evidence against the critical period hypothesis.
7.3 Quality Control in NLP Applications
Within natural language processing (NLP)4 many of the major tasks involve summariza-
tion and/or translation, including document summarization, query-answering and machine
translation. In order for any of the applications within these sub-areas to be considered a
success, there must first be some means of evaluating performance to ensure accuracy and
establish the potential margin of error. While human evaluations are often considered the
“gold standard”, such studies can take months to complete [108]. Further, there is some
question as to their reliability and ultimate limitations [34]. Consequently, much recent work
has gone into evaluating and creating automated evaluation metrics for this very purpose,
such as IBM’s BLEU [108] and ISI/USC’s ROUGE [94]. Such automated techniques are
based upon the “N-gram” metric and have found recent favour within machine translation,
among other areas, for evaluating translation quality. This section explores the basic formu-
lation of such metrics, and suggests how the hybrid error-detection methodology may help
advance the field of automated evaluation.
“N-gram” Metrics for Machine Translation
Although originally introduced to machine translation (MT) within NLP, the so-called “N-
gram” metrics can be applied to a wide range of NLP tasks, united in the goal to establish
document quality on the basis of one or more “correct” reference document(s).
4A more thorough introduction to NLP and subsequently to Medical Language Processing, or MLP, isprovided in Chapter 2.
CHAPTER 7. BEYOND RADIOLOGY 134
BLEU and ROUGE are two popular examples of N-gram-based models of language.
Like the context windows discussed in Chapter 4, N-gram models rely on words’ contexts
to determine an individual word’s “N-gram count” in a text, a useful feature for many
calculations. Here “N-gram” refers to the context size (the words preceding or following the
target word): a “unigram” is the word itself, while “bigram” and “trigram” refer to two-
and three-word tuples, respectively. The “N-gram count” is how often a particular N-gram
occurs in the training corpus.
Essentially, the N-gram metric relies on the notion of “N-gram precision”, computed by
aligning all N-gram counts in the source document with those in the reference documents
[108]. If one considers the word-error rate (WER) calculation used in speech recognition
as a measure of the distance between a document and the underlying, “true” document,
then this measure can be adapted to measure the degree of alignment between a source text
and the reference translation(s) [34, 108]. Like the work in Chapter 4, N-gram evaluation
methods such as BLEU calculate the WER based on the probability of a word occurring in a
given context by conditioning on the preceding words, and is dependent the context-window
size. For example, consider the following:
"My favourite flavour is vanilla."
Instead of calculating the individual probability of each word in its context, the probability
of the entire sentence above is approximated by examining a limited context window for
each word within the sentence, and combining the resulting probabilities [34]. This gives us
the following, where “<s> ” represents the sentence boundary:
P (my, favourite, flavour, is, vanilla) =
P (my| < s >)P (favourite|my) · · ·P (vanilla|is).(7.1)
While a unigram measure can be used, Papineni et al observe that a translation of a text
that uses the same words as the reference translation, but in random order, will still have
a high unigram overlap yet a poor measure of fluency (i.e. document coherence). Thus, a
measure of the length of consecutive matches, which is achieved through the longer N-gram
matches, is needed to account for overall document fluency.
Lately the N-gram methods have met with criticism for their lack of restriction on
translation coherence or grammaticality. As Culy and Riehemann note, the above N-gram
CHAPTER 7. BEYOND RADIOLOGY 135
metrics are not “measures of translation goodness” but rather of document similarity, and
rely on the assumption that “a good translation of a text will be similar to other good
translations of the same text” [34, page 71]. In their experiment, however, they noted
that incomprehensible (low fluency) machine translations could still score higher than a
fluent human translation using these metrics. Furthermore, the quality of the translation
assessment was directly dependent on the number of reference translations available. Culy
and Riehemann conclude that while not terrible, the results of N-gram metrics are not great,
either.
Improving on “N-gram” Metrics using Error Detection
In response to the above weaknesses of existing N-gram-based metrics for evaluation, the
hybrid error-detection algorithm can offer an alternative measure of machine translation
quality.
Observation 8 Machine translation errors often result in grammatical errors.
Observation 9 The word-error rate is a an indication of translation quality.
Observations 8 and 9 are based on the assumption that a poor translation will have a high
number of errors, where those errors can be syntactic or semantic (like speech recognition,
a fixed lexicon will preclude errors at the lexical level). Consider the following translation
candidates from Papineni et al [108]:
It is a guide to action which ensures that the military always
obeys the commands of the party. (Sentence 2)
It is to insure the troops forever hearing the activity guidebook
that party direct. (Sentence 3)
It is clear that Sentence 3 is a poor translation candidate based upon its ungrammat-
icality. The hybrid error-detection method is able to detect such errors, among others.
CHAPTER 7. BEYOND RADIOLOGY 136
Instead of relying on the similarity between a machine-translated document and a series of
reference translations (which can be erroneous themselves), this algorithm requires only the
machine translation output being tested, reducing the complexity, and providing an actual
qualitative assessment that is not based on similarity, or the assumption that the reference
documents are correct and that a similarity comparison will provide a sufficient, intelligent
assessment.
From a different perspective, the hybrid algorithm might be considered an alternative
similarity measure, where the document being analysed is compared to the features of what
would be the grammatically correct text.
The hybrid nature of the error-detection algorithm also allows for discovery of errors
based on type, giving rise to a qualitative evaluations of a document as being “semantically”
sound, despite syntactic errors. This indicates that the concepts within the translation
are correct, meaning the document may, in fact, provide an adequate gist of the original.
Semantic errors, on the other hand, may indicate more fatal flaws in the resulting translation.
As with the other applications of the hybrid error-detection methodology, a sufficient
training corpus characterizing the relevant domain is required for the statistical analysis
portion. However, depending on one’s goals, and the quality of the syntactic and seman-
tic rule-based component, the choice may be made to omit statistical analyses from the
heuristics used.
Related Areas of Application in NLP
The issue of the evaluation of automatically generated text extends beyond machine trans-
lation to other research areas within NLP. It is often the case that a poor translation or
output in one domain of NLP will have syntactic and semantic errors at the very least.
Thus, the error-detection methodology is useful in any evaluation of an NLP technique
where the output is susceptible to grammatical errors introduced as a result of the process
in question. This includes machine translation, document summarization, question answer-
ing, natural-language generation, information extraction and computer-assisted document
proofing. Errors or grammatical inconsistencies present in the output text can indicate
flaws in the underlying generation algorithm, much like how such an error can indicate a
translation error as above.
CHAPTER 7. BEYOND RADIOLOGY 137
7.4 Summary
This chapter has surveyed a range of applications of the hybrid error-detection methodology,
beyond recovery from low recognition rates in radiology report dictation. This establishes
the methodology within the greater context of error detection, and demonstrates the greater
extent of the contributions presented so far. Quantitative analysis of these theories are left
as future work with the hope that the ideas presented within this chapter will serve as
inspiration for further research, including other unique applications of the methodology not
covered here. In the following, final chapter, the conclusions, contributions and consequences
resulting from the research presented in this dissertation are summarized.
Chapter 8
Conclusions
Lured by efficient services with respect to time and money, as well as improved patient
care, medicine continues to incorporate artificial-intelligence technologies more fully into
the existing armamentarium. This includes the gradual replacement of transcriptionists
with ASR systems, and the addition of automated summarization systems in the radiology
department.
Despite the trend towards automation in the reading room, ASR remains a weak al-
ternative to traditional transcription. This is attributable to poor accuracy rates and the
wasted resources spent on proofreading erroneous reports.
The work presented here was motivated by poor integration and low accuracy rates
of ASR in radiology, and the frustration of radiologists with the technology. This had
introduced many delays and wasted resources, including the need to extensively proofread
reports to search for recognition errors, any of which could have serious consequences for
the patient.
In addressing these issues, this dissertation has made several contributions to the field
of error detection. Early in the research a lack of a comprehensive theory of error detection
in ASR was noted, leading to the development of a classification of error-detection methods
to account for this absence and providing an objective measure of existing and future ASR
error-detection endeavours. In addition, the nature of recognition errors was extensively
investigated, providing the foundation for the hybrid methodology in Chapter 3.
A hybrid methodology was postulated as a multi-heuristic, modular approach to ASR
error detection in radiology report dictation (see Chapter 4). This methodology was built
upon the notion of the complementary coverage of error types in different error-detection
138
CHAPTER 8. CONCLUSIONS 139
heuristics. Four AI-inspired heuristics were developed and analysed based upon their varying
strengths with respect to relevant error types and overlapping coverage (to help ensure over-
all system reliability). These included two probabilistic, N-gram-based heuristics: pointwise
mutual information (inspired by the work of Inkpen and Desilets [75]) and co-occurrence
analysis (inspired by previous work by Voll et al [154]). The remaining two heuristics are
non-probabilisitic methods: a parser inspired by constraint handling rules for their ease of
development and suitability for the purposes of proof of concept, and a conceptual distance
metric, inspired by previous work [23].
When the heuristics are combined, the result is a high-coverage, high-accuracy error-
detection system. This was demonstrated with a proof of concept applying the hybrid
methodology to detection of errors in actual radiology reports (presented in Chapter 5).
Most notably, the hybrid approach achieves a 24% recall improvement over any individual
heuristic. This technique shows promise as an effective means to recover from the unaccept-
able accuracy rates of ASR. Flagging potential errors enhances the proofreading process,
restoring the benefits of ASR in resources saved. The result is a more efficient reading room
and an improved experience with ASR.
The implications for radiology given the hybrid error-detection methodology were pre-
sented, both within radiology and within error detection and report summarization in gen-
eral. In addition, a wide range of avenues for future work within radiology report dictation
and ASR were put forth, including the challenges currently faced by the methodology and
ASR in medicine.
The hybrid methodology is not limited to the domain of radiology. To illustrate the
applicability to the general context of error detection, two non-medical domains, namely
speech pathology and machine translation were offered as theoretical applications of the
methodology and inspiration for further work (see Chapter 7).
With this in mind, the research questions first posed in Chapter 1 can now be revis-
ited. Foremost, the question of improving the accuracy of speech recognition in radiology,
has been answered with the novel solution of a post-recognition, hybrid error-detection
methodology. This methodology not only demonstrates how error detection is applicable to
radiology report dictation, but it advances current methods of error detection in general.
Previous attempts at error detection, along with other relevant work, were first examined
to characterize the current state of error detection. The nature of recognition errors was
explored, with a breakdown of error type categories, along with the linguistic levels at which
CHAPTER 8. CONCLUSIONS 140
they occur in Chapter 3. The hybrid error-detection methodology and implementation was
the combination of the extant, relevant work; the error type analysis; and the investigation
of the properties of ASR, ASR-related errors, and radiology reporting. The implications of
this research were explored not only in the ramifications relevant to radiology, but also in
applications outside of medicine.
I lastly reiterate the contributions of this research to the field of error detection in ASR:
• A classification of error-detection methods for speech recognition.
• A hybrid error-detection methodology.
• A successful proof of concept applying the hybrid methodology to radiology report
dictation.
• Two theoretical applications of the technology beyond the domain of radiology.
In conclusion, I submit that the research within this dissertation supports each of the
the hypotheses originally presented in Chapter 1:
• As a post-processing stage, methods in medical language processing can effectively de-
tect recognition errors in radiology reports dictated via automatic speech recognition.
• Combining complementary methods of error detection results in improved sensitivity
to report errors.
• Tagging erroneous reports based on the quality of their output can avoid the need for
a in-depth re-read of the report.
• Post-recognition error detection is a viable means to improve ASR in radiology re-
porting.
• Post-recognition error detection has applications beyond radiology reporting.
Therefore, it is concluded that post-speech-recognition, hybrid error detection is an ef-
fective means to recover from low recognition rates in radiology report dictation.
Appendix A
Glossary of Medical and
Non-Medical Terms
A.1 Radiology
Image Modalities Methods for generating internal images of the body. Modalities
include:
• Magnetic Resonance Imaging (MR) A radiology imaging modality that uses radiofre-
quency waves and a strong magnetic field to take an image of internal tissues. It is
particularly well-suited for soft tissue analysis.
• Radiography (X-ray) An internal image of the body that relies on the different absorp-
tion of X-rays by varying tissues in the body. The X-rays that are not absorbed pass
through the body into the film. These exposed areas of the film turn dark, leaving the
white, characteristic patterns of the bones (which tend to absorb most of the X-ray
energy and hence do not pass through to affect the film).
• Computed (Axial) Tomography (CT/CAT) A technique for obtaining multiple X-ray
images of different angles of the body. These images are then combined using a
computer to generate cross-sectional views of the body.
• Positron emission tomography (PET) A technique for acquiring images based on the
detection of emissions from radioactive substances (typically injected into the body).
Like an X-ray, the emissions are detected using a film that differentiates areas of high
141
APPENDIX A. GLOSSARY OF MEDICAL AND NON-MEDICAL TERMS 142
versus low emissions. Where the radioactive substance collects in the body, this will
produce greater emissions that detectable on the film.
Picture Archiving and Communication System (PACS) In radiology, this sys-
tem manages the acquisition, storage, transmission and display of digital (filmless) images
on a computer network.
Radiograph A picture produced by a radiology imaging technique, such as an X-ray.
This image may be generated traditionally using film, or using a filmless image stored on a
computer.
Radiology A branch of medicine that applies radiant energy or radioactive material
in the diagnosis and treatment of disease.
Radiology Report The radiologist’s report based upon his analysis of a radiograph.
Typically this is dictated and recorded at the time of the image examination, and later tran-
scribed by a stenographer. In the case of automated speech recognition, the transcription
occurs simultaneously with dictation.
• Free-text Report A radiology report in which the information is in unstructured, nat-
ural language format. Contrast with structured reporting.
• Structured Reporting A reporting system in which the information is standardized and
structured. The radiologist is confined to a predefined format and often prompted for
information, in contrast with a free, unrestricted dictation of the report.
Reading Room A room containing large light boxes or computer screens in which a
radiologist examines radiological images, such as X-rays.
Transcriptionist A person responsible for transcribing an audio text dictation. Also
called a stenographer.
Turnaround Time (TAT) The time it takes from report requisition (i.e. a physician
requesting a scan) until the dictated report is completed and signed off by the radiologist.
APPENDIX A. GLOSSARY OF MEDICAL AND NON-MEDICAL TERMS 143
A.2 Computational Linguistics/ Knowledge Representation
Accuracy In NLP, a measure of the performance of any NLP algorithm (see also
Evaluation). In general, accuracy is defined by the following three measures (defined with
respect to error detection):
• Precision A measure of the number of relevant errors (true positives, TP) found ver-
sus the number of relevant and irrelevant errors (true and false positives, TP and
FP). Essentially asks the question: Of the number of the errors found, how many
corresponded to actual errors?
P =TP
TP + FP(A.1)
• Recall A measure of the number of errors identified (true positives, TP) versus the
total number of errors actually present (true positives and false negatives, TP and
FN). Essentially asks the question: Of the number of actual errors, how many did the
algorithm find?
R =TP
TP + FN(A.2)
• f-Measure Although precision and recall are inversely related, this provides a combined
measure of performance. Typically precision and recall are weighted evenly and result
in the following definition of f-measure:
F =2 · P ·RP + R
(A.3)
Collocation In linguistics, this refers to one or more words that occur frequently
together and generally have connotations beyond the meaning of the component words. For
the purposes of this document, this definition has been relaxed to refer to any two-word
consecutive pairing in the text.
Constituent A functional unit of one or more words. A group of words is considered
a constituent of some type if all constituents of that type can occur in similar syntactic
environments [81]. For instance, a constituent of type noun phrase (that is, a grouping
APPENDIX A. GLOSSARY OF MEDICAL AND NON-MEDICAL TERMS 144
of words around a noun) can occur before a verb. The following are examples of other
constituents:
The cat in the hat
Marcy hiccupped
The pickle that ended up in the sandwich tasted
Alternatively, a grouping of words may be considered a constituent if they must be
moved as a unit to a different position in a sentence, in contrast to moving the component
words individually. For example, the prepositional phrase “by October sixth” can be placed
in a number of different locations. It can be place at the beginning, the middle, or the end
of a sentence [81]. Consider the following:
By October sixth, I would like to finish my thesis.
I would like, by October sixth, to finish my thesis.
I would like to finish my thesis by October sixth.
*By, I would like to finish my thesis October sixth.
It is possible to move the prepositional phrase as a unit, but not the individual words
making up the phrase (as shown in the last sentence). This ability to identify constituent
structures is important in identifying patterns in the language, as well as understanding
how words work. This notion is used in Section 4.5 to constrain the subset of words created
from a sentence.
Constraint Handling Rules Grammar (CHRG) A logical grammar formalization
built upon constraint handling rules (CHR [51]) and developed by Henning Christiansen [29].
Christensen observes that the relationship between CHRs and CHRGs is analogous to the
relationship between Prolog and definite clause grammars (DCGs).
Computational Linguistics The use of computers to augment linguistic study. Of-
ten equated with natural language processing.
Co-occurrence Relations A relationship defined between a target word and the
words with which it co-occurs in a text. These words are referred to as the “context” and
are defined by those words occurring within a certain radius of the target word. This radius
is defined by the “window size”. A window size of three, for example, refers to the three
words before and after the target word (thus the total window size is six).
APPENDIX A. GLOSSARY OF MEDICAL AND NON-MEDICAL TERMS 145
Evaluation The following measures are used in the evaluation of many language-
processing techniques, including the hybrid error-detection methodology presented in this
document:
• True Positive A correct identification of an error.
• True Negative A correct identification of a non-error.
• False Positive An incorrect identification of an error.
• False Negative An incorrect identification of a non-error.
f-Measure See Accuracy.
Grammar The rules governing a language and appropriate utterances within that
language.
Lexeme Any meaningful linguistic unit.
Linguistics The study of human language, including structure, meaning and evolu-
tion.
Medical Language Processing (MLP) The application of NLP technology to the
medical domain.
Natural Language Processing (NLP) The subfield in artificial intelligence that
deals with the processing of natural human languages, such as English. This processing
includes translating human language into a formal representation that a computer can ma-
nipulate, as well as the reverse: translating a formal computer representation into natural
language. This discipline comprises many areas of inquiry, including, but not limited to,
the following:
• Speech recognition
• Natural language generation
• Information retrieval and extraction
• Machine translation
APPENDIX A. GLOSSARY OF MEDICAL AND NON-MEDICAL TERMS 146
• Question answering
• Automatic summarization
While the goals of NLP and computational linguistics often overlap, the motivations are
slightly different. Within NLP the goal is the ability of computers to process natural
language, often irrespective of the underlying linguistic theory. This is in contrast with
computational linguists who seek to augment linguistic knowledge of human languages with
computers.
Natural Language Understanding The ability of a computer to process language
and respond to that language in a way that mimics human understanding of language. That
is, the responses are appropriate based on the expectations of the user and/or domain.
Parser A program that disassembles language into its component parts using a set of
rules known as a grammar.
Phone An independent speech sound event.
Pointwise Mutual Information (PMI) A measure of the information provided by
an event (such as a word in a text) about the occurrence of another event (or word):
PMI = logP (x, y)
P (x) · P (y)(A.4)
Precision See Accuracy.
Property Grammar A grammar that represents linguistic information via proper-
ties that describe the rules of the language. Parsing is the process of ensuring that these
properties are met within a text. This representation lends itself naturally to expression via
constraints (and hence CHRGs).
Prosody The quantitative and qualitative properties of speech (not text) involving
intonation, cadence, and stress.
Recall See Accuracy.
APPENDIX A. GLOSSARY OF MEDICAL AND NON-MEDICAL TERMS 147
Statistical NLP A subfield of NLP that relies on a statistical account of the common
patterns of natural language usage. Also called corpus-based linguistics. This is in contrast
with rule-based approaches, which seek to characterize language on the basis of proposed
rules that govern linguistics.
Stop Word A word that is considered to have little overall information with respect to
a text, usually because of its high frequency within that text. This is generally irrespective
of the word’s semantic role within the language.
Stop List A list of words being considered as stop words.
Ontology A formalized taxonomic representation of semantic knowledge. Typically
comprised of relations such as “is-a”. Refer to Appendix B.
Utterance A spoken language segment.
A.3 Automated Speech Recognition
Automated Speech Recognition (ASR) The automated recognition of spoken text
by a computer. As discussed in Chapter 3, ASR can refer more broadly to any AI tasks
involving speech, however, for the purposes of this document, it refers only to the recognition
of human speech.
Recognition Error A word incorrectly transcribed by the speech recognizer.
Word Error Rate (WER) The rate of recognition errors produced by an ASR pro-
gram. This can be measured in a variety of ways. For the purposes of this document, the
WER is defined as follows:
Cor(d) = 1−WER
A.4 Miscellaneous
Black Box Anything for which the internal workings are unknown to the user.
APPENDIX A. GLOSSARY OF MEDICAL AND NON-MEDICAL TERMS 148
Constraint Handling Rules (CHR) CHR was first proposed by Fruhwirth in 1998
[51] as a declarative, high-level language for expressing and solving constraints. Comprised of
guarded rules (rules with restrictions on their application), CHRs permit constraint simpli-
fication and propagation through constraint rewriting. The constraint engine is responsible
for determining when CHRs apply, that is when constraints are collapsed into simpler ones,
or when they give rise to new constraints. See also constraint handling rules grammars
(CHRGs).
Hybrid Method Any composite method comprising multiple techniques applied to
the same problem to achieve a single outcome.
Appendix B
Ontologies in Healthcare
B.1 Introduction
In recent years, the term “ontology” has become somewhat of a buzzword – overused and
under-defined. Its recent ubiquity within computing, and in particular medical informatics,
has had the unfortunate side effect of an erosion of meaning and increased ambiguity in day-
to-day usage. Thus, the obvious and immediate task is to define “ontology” and establish
a consistent level of discourse. Beyond that, this chapter considers the motivation behind
ontological research as it pertains to medical informatics, the principles of good ontology
design, and the major systems in use today.
B.1.1 Controlled Medical Vocabulary
Within medical informatics there exist information structures called controlled medical vo-
cabularies. These represent and classify medical terms in order to systematize the represen-
tation of medical concepts. In general, however, they tend to focus on nouns and lack an
expanded vocabulary required to handle non-medical terms such as qualifiers (e.g. size or
degree) [80].
B.1.2 Semantic Lexicon
Like a syntactic lexicon that provides information regarding part-of-speech, et cetera, a
semantic lexicon is a “resource that maps a lexical item (word or phrase) to one or more
semantic types” [80, page 206]. Such a lexicon is necessary to translate natural language
149
APPENDIX B. ONTOLOGIES IN HEALTHCARE 150
text into a format that a computer can understand and manipulate. These semantic types
stand for entities within the domain of discourse and can be systematically arranged into an
ontology. An ontology defines meaning formally within a domain by virtue of its structure
and the interrelations it defines between the semantic types [80].
B.1.3 Ontology
In the literature, the term “ontology” has ended up an over-generalization, referring instead
to a continuum of terminological knowledge. Ontologies, in the strict sense, exist at one end
of the continuum, while loosely controlled sets of concepts or terms exist at the other [14].
The unfortunate consequence of this has been needless ambiguity and a dilution of the actual
meaning of the word. As Bodenreider observes, “although more than sixty terminological
systems exist in the biomedical domain, few actually qualify as an ontology” [15].
The name “ontology” derives from philosophy where it embodies a “systematic account
of Existence” [58]. Within knowledge-based systems, this account of existence is simply
that which can be represented. On the most fundamental level when referring to a body of
knowledge one can talk about a conceptualization: “the objects, concepts, and other entities
that are presumed to exist in some area of interest and the relationships that hold them”
[58].. In other words, the abstract and simplified representation of the domain of discourse
that lies beneath any knowledge base.
It follows that an ontology is “an explicit specification of a conceptualization” [58] – a
formalism for a shared base of knowledge or understanding within a community that permits
intelligent discourse. More specifically, it identifies through a representational vocabulary
the entities, or semantic types, that exist within that domain of discourse, and the logical
relationships between them. An ontology is an ontology by virtue of the logical framework
that it sits upon. This framework is a defining feature of a true ontology and provides the
strict laws governing the relationships within as well as the addition of new concepts, and
exists independently of any specific application of the terminology [14, 15]. It is a direct
result of strict adherence to this formal structure that ontologies achieve their ability to
support sound reasoning (via inheritance or subsumption) and hence their power.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 151
B.1.4 The Continuum of Knowledge Representation
If (reference) terminologies represent a continuum of increasing formalization and structure,
then ontologies exist at the far, formal, end. At the opposite end, coding schemes exist
as comparatively flat and contrived structures to which medical codes are allocated. The
conditions determining classification are coded in one place only and do not capture any
relationships among the terms. Examples include the ICD and DICOM. Such schemes
seek to improve consistency in medical terminology, but do not support depth of structure,
and thus limit reasoning. They can be used as a terminological repository for populating
ontologies or other medical collections.
Taxonomies are the next step up from simple coding schemes, and can range from a min-
imally (or poorly) structured representation of the domain to a fully structured represen-
tation. Existing as hierarchies denoting taxonomic is-a relationships via subordination[21],
taxonomies were in part conceived to address the naming problem, the task of “determining
a controlled set of language labels” [14]. In this way, taxonomies account for variations in
terminology that arise when the same concepts are expressed by different sources (often
for different applications) [103]. Although in theory taxonomic relationships are conducive
to information storage and reuse, in reality many taxonomies are poorly or insufficiently
structured, lacking the necessary rigor for formal inheritance [14, 161]. Worse, it is often
the case that the inter-relationships within a taxonomy contain non-taxonomic relationships
such as meronomy (part-of ) or hyponymy (kind-of ) that have not been explicitly defined
or even acknowledged. Thus, attempts to reason on the assumption of taxonomy are open
to inconsistency and error [13].
Although within the literature taxonomies are often equated with ontologies, only a
well-constructed, rigorous taxonomy that supports formal reasoning is a true ontology. On-
tologies are challenging to construct and are computationally complex for anything beyond
a trivial domain. The high degree of formality, however, is not always necessary depend-
ing on the application. Semantic spaces exist as intermediary taxonomies between basic
taxonomies and ontologies. While increasing in formalityy, though still lacking the full
power of ontologies, semantic spaces introduce more in-depth and controlled relationships
amongst the terms in the hierarchy[14]. By adding these restrictions on the data, semantic
spaces increase the accuracy of information retrieval, though still fall short of full reasoning
capabilities.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 152
Taxonomies
Coding Schemes Taxonomies Semantic Spaces Ontologies
Terminologies
Degree of FormalizationLess formal
E.g. ICD, DICOM
More formal
E.g. SNOMED-CT,UMLS
Figure B.1: The Knowledge Continuum.
Although most taxonomies are not themselves ontologies, Bodenreider does point out
that it is possible to add the “formality and consistency to the organization of a partially
structured set of concept[s]” [14] in order to create a true ontology.
It is important to consider that the rigid formality expected in the philosophical sense of
ontology is only a logical approximation in the knowledge representation sense [117]. Rector
observes that the concepts expressed in language elude the concrete expression of logic due
to the “flexible fluid dependence on context”. Therefore, the knowledge found within an
ontology is at best an approximation of that knowledge, which will forever remain open
to further specification in a tradeoff between expressivity and tractability [117]. This is
not necessarily a fault of the ontological specification, but a reflection of the dynamic and
productive nature of language. A definition of “ontology” demands a maximally formal
representation given this consideration.
In addition to “ontology”, a second ambiguous term, “vocabulary”, is often used to re-
fer to any of the above knowledge structures, or to refer to the terminology representing
the entities within an ontology [58]. Figure B.1.4 illustrates these terms and their coverage.
Many of the existing coding or classification systems lack sufficient granularity for capturing
the relevant information in medicine [5]. Attempts to expand these systems have resulted
in “combinatorial explosions” and terminologies that are simply too large to maintain [5].
Furthermore, without any explicit representation of the relationships within these termi-
nologies they are largely unmanageable and it is difficult to write software using them [5].
APPENDIX B. ONTOLOGIES IN HEALTHCARE 153
There have been major attempts at improving such terminologies through imposing formal
structure either on existing systems (e.g. SNOMED CT), or by designing from the bottom
up (e.g. GALEN). An introduction to existing terminologies is presented in Section B.3.1.
B.1.5 Principles of Good Ontologies
Alan Rector identifies four types of information typically encountered in medical informatics
[114]:
1. Information on individual patients (i.e. medical records);
2. Information on populations of patients;
3. Information on institutions and the health care system;
4. Information on the current state of knowledge of best medical practices (i.e. knowledge
management and decision support in its widest sense).
He also lays out four primary and four secondary tasks for processing this information,
namely:
Primary Tasks:
1. Entering patient data;
2. Presenting information about particular patients;
3. Patient population information retrieval (e.g. a query-answer system);
4. Sharing and integrating information.
Secondary Tasks:
1. Navigating and browsing information;
2. Authoring knowledge;
3. Indexing knowledge;
4. Analyzing and generating natural language.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 154
In order to handle these tasks, the representation and processing of information relies
heavily on the terminology upon which the system is designed. Despite this, Rector cautions,
“there is as yet no proof that a general re-usable terminology serving all of the aspirations for
clinical information systems is possible.” [114]. Even so, the growing need for terminological
representation in medical informatics continues to drive research. While no single ontology
will serve all medical informatics applications, adherence to good ontological principles will
allow multiple ontologies to be linked together, achieving what a single one could not.
The principles of good ontological design can be split into three main areas: the principles
of classification, the principles of inheritance, and the principles of partial ordering.
Principles of Classification
Classification allows the organization of information on the basis of relationships as opposed
to knowledge in isolation (such as a traditional dictionary) [100]. This organization can vary
in formality from single-level coding schemes to the formal structure of an ontology. For
consideration as an ontology, however, a terminology must adhere to the following principles
of classification [114, 100, 15, 21].
1. Subordinate classes must be mutually exclusive – that is a concept must be uniquely
identified.
2. Subordinate classes must be jointly exhaustive – that is the implication that there are
no further concepts than what have been represented in the classification.
3. A hierarchy can have only a single root.
4. Each class must have at least one parent.
5. Non-leaf classes must have at least two children.
6. Each child must differ from its parent; siblings must differ from one another.
7. Adherence to the Economy Principle [21]:
Rule 1 Assign the most specific semantic type available.
Rule 2 Assign multiple semantic types if necessary.
Rule 3 Assign a less specific semantic type if no more specific semantic type is available.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 155
Principles of Inheritance
In addition to the principles of classification, it is possible to derive a series of principles of
inheritance existing between a parent and its child, or between a class and its immediate
subordinate [16]:
1. Unique differentia. That is differentia (distinguishing criteria) from child to parent
should uniquely result either from the refinement of the value of a common role, or the
introduction of a new role [21]. For example, the introduction of the role CAUSATIVE
AGENT with value Infectious Agent explains the subsumption relation of Meningitis
to Infective Meningitis. Similarly, the subsumption relation of Infective Meningitis
to Viral Meningitis is explained by the refinement of the role value for CAUSATIVE
AGENT since Infectious agent subsumes Virus [21].
2. If A is a child of B, then all properties of B are also properties of A (via inheritance).
3. Cycles are forbidden.
4. Adherence to the Sibling Opposition Principle [21]: A category must be opposed to its
siblings via some differentia that is fundamentally unresolvable within the ontology.
Note that Inheritance Principles 1 and 4 essentially extend Classification Principle 6.
It is also important to note that the Sibling Opposition Principle and the Economy
Principle are not accepted by all researchers as desired qualities. The Sibling Opposition
Principle says that in order to maintain unambiguous representation, children of a class must
stand in opposition to one another. That is, they must differ in a way that is fundamentally
unresolvable within the ontology [19, 21]. The validity of this assumption comes under
question, however, with respect to certain concepts that stand better in relations of scale
than of opposition, such as differentia that cannot be defined with precision (i.e. discretely)
[21].
Similarly, the Economy Principle was developed by the designers of the UMLS to prevent
unnecessary categories from being represented. One of its sub-principles explicitly requires
that the relations stand in a strict hierarchy (that is no hybrid types inherit from two super-
classes [21]). Critics of this principle, however, observe that hybrid subtypes are sometimes
necessary to capture the full essence of a complicated concept [21].
APPENDIX B. ONTOLOGIES IN HEALTHCARE 156
The Economy Principle and Classification Principle 2 have been jointly referred to as
the Principle of Orthogonal Taxonomies, stating that properties and differentiae must be
“represented explicitly and independently, even at the cost of apparent redundancy” [114].
Instead any information gained from hybrid or multi-parent classifications must “be inferred
from the descriptions and definitions” [114]. In theory, this makes it possible to re-arrange
the hierarchy along any axis (e.g. anatomy, pathology, et cetera).
Principles of Partial Orderings
Lastly, in order to ensure compatibility, an ontology must adhere to the intrinsic principles
of partial orderings, essentially a mathematical definition of hierarchy [13].
1. Reflexivity Every element of a set is related to itself.
2. Antisymmetry If x is related to y, y is not related to x.
3. Transitivity If x is related to y, y is related to z, then x is related to z.
The less formal semantic spaces, for instance, do not exhibit these properties (one of
the reasons for their lower classification on the scale of knowledge representation) [13]. The
UMLS is-a relationship, for example, does not exhibit reflexivity [21]. Consider ibupro-
fen, a “non-steroidal anti-inflammatory (NSAI) substance”. Within the Metathesaurus (a
subcomponent of the UMLS) “ibuprofen” and “NSAI” correctly exist as part of an is-a rela-
tionship. However, although both “ibuprofen” and “NSAI” are represented as the semantic
type, Pharmacologic Substance, there is no corresponding is-a relationship standing between
Pharmacologic Substance and itself to mirror the is-a relationship between “ibuprofen” and
“NSAI” [21]. Thus, the terms are artificially compressed into the single term Pharmacologic
Substance, which could affect later reasoning.
If a parent-child relationship holds between concepts C1 and C2, antisymmetry ensures
that the reverse relationship cannot hold (or if it does, that those concepts must be equiva-
lent). The UMLS encounters violations of this principle because of its development through
the inclusion of multiple vocabularies.
Violations of transitivity arise from the presence of implicit qualifiers in medical terms
[21]. Consider the UMLS vocabulary1. [21] observe:
1See Section 4 for a more in-depth discussion of the UMLS.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 157
The isa relation is found in the UMLS at three different levels: between se-
mantic types in the Semantic Network, between concepts in the Metathesaurus,
and between a concept and a semantic type through the categorization. As-
suming that this isa relation represents the same kind of abstraction at different
levels in the UMLS, transitivity is expected to apply not only between semantic
types [in the Semantic Network], or between Metathesaurus concepts, but also
between semantic types and Metathesaurus concepts. Thus, the semantic type
of any ancestor C1 of a concept C2 is expected to be a supertype of the semantic
type of C2.
Consider the violation of transitivity arising with the concept “hip dislocation”, which is
grouped together with the more specific concept “acquired hip dislocation”. This grouping
has arisen from the more frequent observation of acquired hip dislocations compared to
congenital ones. It requires that “congenital hip dislocation” be a child of “[acquired] hip
dislocation”, which is in turn a child of the semantic type Injury or Poisoning. “Congenital
hip dislocation”, however, is also a child of the semantic type Congenital Abnormality.
Based on transitivity an is-a relationship must also exist between Congenital Abnormality
and Injury or Poisoning. However, only non-taxonomic relationships are postulated, such
as has-result; complicates [21].
Non-specificity of the is-a relationship can give rise to a weaker version such as is-
generally-a, and can also undermine transitivity. Again, using the UMLS for example,
Burgun and Bodenreider cite “Addison’s disease”, which is found in the relationship: “Ad-
dison’s disease” is-a “autoimmune disease”. While true typically, the is-a in this instance
should be an is-generally-a relationship to avoid the exception, “Tuberculous Addison’s dis-
ease”, which results via transitivity in the erroneous relationship “Tuberculous Addison’s
disease” is-a “autoimmune disease” [21].
B.2 Methods of Knowledge Representation
Within a controlled vocabulary or ontology, there are several ways in which conceptual
information can be represented. These variations can affect the specificity of the information
stored, as well as the ability to link these knowledge sources to other systems. By meeting
certain conditions, a meaning representation ensures that it can be used for reasoning tasks.
These conditions include [81]:
APPENDIX B. ONTOLOGIES IN HEALTHCARE 158
Verifiability. The ability to determine the truth value of a meaning representation;
Clarity. Unambiguous data;
Consistency. Ensuring that types with the same meaning are in fact mapped to the same
concept structure;
Expressivity. Ensuring adequate expressivity sufficient for the task at hand.
This section gives a brief introduction to some of the commonly-used meaning represen-
tation methods.
B.2.1 First Order Predicate Calculus (FOPC)
FOPC is a well-understood formalism for representing meaning that meets the conditions
mentioned above. Computationally tractable, it places few restrictions on how concepts
are represented. The language is verifiable, highly expressive, and provides a means for
solid inference (e.g. forward or backward chaining systems) [81]. One of the major pitfalls,
though, is the assumption that the English conjunctives2 such as and, or and if are directly
related to the equivalently-named FOPC terms [81]. This can quickly lead to inconsistency
within the system.
In addition, inference methods such as forward or backward chaining are sound but not
complete. Therefore, it is possible that some valid inferences are not obtained by a system
employing these techniques. Unfortunately, the alternative method, resolution, can be very
computationally-expensive [81] (a tradeoff that may be acceptable in certain, small-scale
applications).
B.2.2 Semantic Networks
The 1970s saw the emergence of semantic networks as an attempt to standardize meaning
representation and step away from the rather ad hoc attempts during the 1960s. Semantic
networks represent the meaning of concepts as defined by the relations held with other
concepts: “in general, semantic networks attempt to impart common sense knowledge to
computers, allowing them to ‘reason’ and draw conclusions about entities by virtue of the
categories to which they have been assigned” [103, page 373]. Concepts are represented
2Similarly for conjunctions in other languages.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 159
by nodes in a graph whose links define the binary relationships held between other nodes.
The standard relationships include ISA (is-a) and AKO (a-kind-of ), which link a class to
its superclass. For instance, ISA(dog, mammal) links dog to its superclass mammal. The
encodings of concepts within a semantic network can be informal or formal, however, they
typically lack any axioms for reasoning.
Although, semantic networks allow for an explicit and concise statement of the associ-
ations between concepts, the lack of a standard interpretation, or a standard for the links
joining concepts, limits the usefulness in systems relying on data from multiple sources
(such as might be found in a comprehensive medical system). In partial answer to this
criticism the KL-ONE family of knowledge representation systems was developed based on
the structured inheritance network initially proposed by Ron Brachman in his PhD thesis
[163], which ultimately led to the development of description logics (see Section B.2.4).
B.2.3 Frame-Based Representations
In the 1970s and 1980s, researchers further structured the semantic network representa-
tion into frame-based representations. Frames contain slots that maintain the relations to
other frames and share a lot in common with objects in the object-oriented, programming
paradigm. Each slot is broken down into facets representing not only the value, but in-
formation about that value and/or slot such as default values, constraints or axioms [163].
These systems are capable of a new type of inference, called classification, which allow the
system to automatically determine the appropriate place in an existing hierarchy of objects
for a new object or description.
B.2.4 Description Logic
Inspired by the ambiguities present in “early semantic networks and frames”, description
logics (DLs) are a family of knowledge representation formalisms for the logical represen-
tation of terminology [5, 140]. As mentioned previously, KL-ONE, for example, was an
early DL-based formalism [163]. In general, the following statements, initially put forth by
Ronald Brachman, characterize a DL system ([5]):
1. The building blocks consist of atomic concepts (unary predicates), atomic roles (binary
predicates), and individuals (constants).
APPENDIX B. ONTOLOGIES IN HEALTHCARE 160
2. Expressivity is limited by a small set of constructors for building complex concepts
and roles.
3. Implicit knowledge can be inferred from explicit knowledge through subsumption and
instance relationships.
Essentially DLs identify the domain of discourse, namely the concepts represented by
the terminology, and then generate a world description using those concepts to describe
the properties of objects within the domain [5]. This description is provided using con-
cepts (classes), roles (properties and relations) and individuals (instances of classes) 3. The
strength of the DL formalism comes from its formal, logic-based semantics that allows rea-
soning (the inference of implicit knowledge from explicitly represented knowledge). DLs
support classification of concepts and allow for an algorithmic specification of hierarchical
knowledge and synonymy [140]. Most importantly, though, they balance the tradeoffs be-
tween the rigor of first-order logic (FOL) and expressivity, resulting in a relatively expressive
yet decidable system [15, 31].
The advantages of this expressivity include more accurate and less ambiguous represen-
tation of concept semantics; more advanced inferencing (which can help in maintaining a
consistent system); and better possibilities for querying and aggregation [31]. In the max-
imally expressive case, FOL, tractability and decidability are compromised, while in the
more limited frame-based or DL systems, tractability is maintained at the cost of some
expressivity [105].
One of the strengths of DL formalisms over semantic networks and frame-based systems
is that the user need not explicitly introduce is-a relationships. Instead, the subsumption
and instance relationships are “inferred from the definition of the concepts and the prop-
erties of the individuals”[5, page 45]. DLs have the ability to define such concepts using
“explicitly agreed-upon semantics”, in contrast with “[f]rames, where semantics often de-
pend on interpretation” [31]. Knowledge is modeled using these concept definitions and their
inter-relations or roles. As the authors in [16] observe, “they promise to make available for
formal reasoning tools detailed descriptions for each class, representing through roles the
defining characteristics of these classes”.
In [117] the authors identify five elements of a DL-based ontology:
3“Description Logics”: http://www.openclinical.org/descriptionlogics.html Accessed: February 2006; Up-dated: October 2004.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 161
1. A hierarchy of elementary categories (atomic concepts);
2. A hierarchy of semantic links (roles) that connect the elementary categories (note that
subsumption relationships are represented as a logical inference of the form “All Bs
are As”);
3. A set of definitions of composite concepts in terms of the elementary concepts, such
as “foot bone” = Bone which isStructuralComponentOf Foot;
4. An axiom base of the form “All X haveLinkTo some Y”, such as “All feet are a division
of some LowerExtremity”;
5. A constraint base determining what concepts can be linked via roles.
The most widely known medical terminology efforts to incorporate DLs have been GALEN
and SNOMED-CT.
While DLs do readily lend themselves to automatic reasoning and information retrieval,
they “do not systematically ensure compliance with the principles of classification required
if reasoning is to be performed accurately” [16] . Thus, such compliance falls on the ontology
designers.
As a last note of interest, some DL formalisms include an epistemic operator that “makes
it possible to define what is known about a concept” [31], adding another facet to ontology
design.
B.3 Medical Terminologies
In general, medical terminologies can be evaluated according to their domain coverage,
intended use, and the techniques underlying their construction [14]. The use of such termi-
nologies can range from billing to record keeping to a full-fledged reference terminology. A
reference terminology provides a common framework into which other knowledge sources can
be linked using the same mapping schema (e.g. SNOMED-CT). With respect to construc-
tion, terminologies also range from simple enumerated lists such as the ICD to compositional
approaches that rely on maintaining a set of atomic concepts from which all concepts are
generated. Compositional solutions, however, often do not sufficiently capture the essence
of the concepts they represent, such as recognizing that “hepatitis” and “inflammation of
the liver” refer to the same thing [14]. One way to enhance the capabilities of terminologies
APPENDIX B. ONTOLOGIES IN HEALTHCARE 162
has been to incorporate lexical techniques, which take into account the lexical aspects of
concepts as they are expressed. By breaking concepts down in this fashion, it is possible to
unify phrases across existing terminologies, as well as in free-text reports, research papers,
and the World Wide Web. This is the basis of the UMLS [14].
B.3.1 Existing Vocabularies and Ontologies
There exist a handful of wide-coverage vocabularies within medical informatics that have
seen fairly widespread use. These range in formality from semantic spaces to ontologies.
Unified Medical Language System (UMLS)
In 1986, the National Library of Medicine (NLM) began developing the Unified Medical
Language System (UMLS). Headed by Donald Lindberg, M.D., then director of the NLM,
the goal was to create an integrated vocabulary based upon a unified semantic structure in
anticipation of the growth of electronically available medical information [103, 2]. The UMLS
is intended to transcend these terminological variations by unifying the available electronic
medical vocabularies according to a standardized semantic structure and representation of
lexical items.
Three primary knowledge sources comprise the UMLS: the Metathesaurus, the Seman-
tic Network and the SPECIALIST Lexicon. The Metathesaurus is the primary vocabulary
database containing information on concepts obtained from pre-existing vocabularies such
as GALEN and SNOMED-CT [103, 48]. This integration is performed on the source vocab-
ulary’s existing meaning representations and inter-relationships through unification. Rep-
resentations are expanded to include any missing essential primitive knowledge, as well as
instantiating new relationships across the different source vocabularies [14, 103]. In essence,
the multiple trees from the source vocabularies are brought together into one unified graph.
Structurally, the Metathesaurus is arranged so that all words and phrases referring
to the same concept are grouped together and linked to other concepts present in the
Metathesaurus. These links define the “semantic neighbourhood” of a concept, and can be
navigated to obtain the names of needed concepts [103]. In the case of a polysemous word,
each meaning exists as its own concept within the Metathesaurus. The intervening links
then define the relationships between the concepts, such as hierarchy and context.
The Metathesaurus preserves any contextual assumptions present in a source vocabulary.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 163
Such context may be reflected in the hierarchical arrangement of the vocabulary. Conse-
quently, a single concept may appear in more than one hierarchy within the Metathesaurus
[103]
According to the UMLS fact sheet4, the 2003 edition of the Metathesaurus “includes
900,551 concepts and 2.5 million concept names,” spread over 100 biomedical, multiple-
language, source vocabularies.
The Semantic Network provides two high-level hierarchies of semantic types intended to
categorize the concepts within the Metathesaurus, namely “event” and “entity” [21]. All
other semantic types within the hierarchies are directly or indirectly linked back to these
two types [103]. Each concept within the Metathesaurus is assigned one (or more) semantic
types based on the most specific one available in the Semantic Network [103]. More general
than the concept level, semantic types permit a broad categorization of the concept, allowing
some reasoning about the definition of the concept. For instance, at the Semantic-Network
level high-level knowledge such as “drugs treat diseases” is represented, whereas at the
Metathesaurus level the more specific, low-level knowledge such as “aspirin treats fever” is
represented [21]. A relationship at the Semantic Network level, however, will not necessarily
hold for all pairs of low-level concepts at the Metathesaurus level assigned to those semantic
types (see discussion of transitivity in Section B.1.5); for instance, not every drug treats
every disease.
Lastly, the SPECIALIST Lexicon is responsible for maintaining the syntactic, morpho-
logical and orthographic information for both the medical and non-medical words found in
English.
Evaluation. The UMLS is a unique, two-level arrangement of the Semantic Network
and the Metathesaurus [14]. On the one level, the semantic network is a type hierarchy
compatible with the ontology definition laid out above. The Metathesaurus, however, by
its very construction, is bound by different principles. It is often the case that multiple
organizational principles are inherited via the source vocabularies, and that these principles
do not necessarily adhere to the ones cited above. Thus, as Bodenreider concludes, “the
Metathesaurus fails to meet basic ontological requirements” and is better categorized at
the level of semantic space at present [14]. Indeed, “the current level of organization is not
consistent and principled enough to fully support reasoning” [14, page 7].
4http://www.nlm.nih.gov/pubs/factsheets/umls.html Accessed: February 2006; Updated: May 2004.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 164
Frequently cited issues include circularity within the Metathesaurus hierarchy, as well as
categorization inconsistencies and incongruities in the Semantic Network and Metathesaurus
[14]. Much of this can be blamed on the unrestricted adoption of source vocabularies;
without methods to enforce sound ontological principles, the UMLS is limited by the degree
of rigor present in its source vocabularies. These source vocabularies are not restricted
in the types of relationships expressed, nor, as mentioned above, is the nature of these
relationships often defined. As a result, there can be no assumptions about the nature of
the relationships present within the UMLS either [14].
There are other causes for circularity within the Metathesaurus, including underspecified
terms, represented as “unspecified” or “not otherwise specified”. If a term, T, is listed as
“T, unspecified”, it is generally considered as a descendant of “T”. This is not the case
in the Metathesaurus where such terms are clustered together as one since they are not
considered to have a difference in meaning [14]. This compression of concepts can result in
a circular relationship.
Other difficulties arise with the presence of implicit knowledge in the source vocabulary
that is incorrectly characterized in the Metathesaurus, or absent altogether. By erroneously
collapsing terms this information is lost and further circularity is introduced.
As an evaluation of the UMLS, Bodenreider et al identify the semantic neighbourhood
of the concept “heart” [14]. In doing so they discovered that out of 6894 pairs of related
concepts, 65% were capable of being “inferred unambiguously from the Semantic Network”,
22% exhibited multiple semantic links between the two, and 13% revealed an inconsistency
between the Semantic Network and the Metathesaurus. In many cases, they noted that dis-
crepancies between the two levels were actually an artefact of the representation of abstract
versus concrete semantic types. The relationship of an abstract concepts to a concrete con-
cept would not be captured at the Semantic Network level, thus appearing as a discrepancy
with respect to the Metathesaurus versus the Semantic Network. Other reasons pertain to
the assumptions placed on the source vocabulary by the UMLS, such as patronymic rela-
tionships. In the Semantic Network, part-of relationships are assumed to be associative,
which may contrast with the interpretation in a source vocabulary (e.g. used hierarchically)
[14].
Issues with respect to the principles of partial orderings also arise within the UMLS due
to the asymmetric relationship between the relations in the Semantic Network and those in
the Metathesaurus. See Section B.1.5 on transitivity for a more detailed example.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 165
Systemized Nomenclature of Human and Veterinary Medincine (SNOMED)
The Systematized Nomenclature of Human and Veterinary Medicine, SNOMED, can trace
its history back to the Systemized Nomenclature of Pathology, SNOP, developed in 1965 un-
der the College of American Pathologists (CAP) 5. Developed initially as a “comprehensive
and flexible tool for pathologists interested in the storage and retrieval of medical data”,
SNOMED was born in the 1970s when Dr. Roger Cote extended SNOP beyond pathology
to a wide range of specialties within medicine. In 2000, SNOMED-RT was introduced as a
concept-based reference terminology. Comprised of multiple hierarchies, it contained over
121,000 concepts that were linked to over 190,000 synonymous terms. Finally, in early 2002,
CAP released SNOMED-CT, formed through the amalgamation of SNOMED-RT and the
Clinical Terms Version 3 (previously known as the Read Codes) [25]. Today, SNOMED-CT
exists as a compositional reference terminology. As of the 2005 edition, it contains over
980,000 English language descriptions (or synonyms), 1.45 million semantic relationships,
and over 360,000 uniquely identified concepts. These are distributed over 18 hierarchies [16].
The mathematical, hierarchical relationships within SNOMED-CT are expressed via a
DL-like formalism [5]. Each class has a unique description consisting of a unique identifier
number, (at least) one parent, as well as a list of synonymous names [16]. In addition,
classes are assigned unique and fully specified names consisting of a regular (English) name
immediately followed by a parenthetical reference to the “primary hierarchy” of the class.
This reference roughly corresponds to one of the top levels of the SNOMED-CT hierarchy
[16]. With the obvious exception of the root, each class is “linked hierarchically to exactly
one top-level class” [16]).
Inheritance is represented within SNOMED-CT via is-a relationships between classes,
which are refined through their role fillers [16]. (See discussion of inheritance principle
above).
Evaluation. SNOMED-CT is perhaps best thought of as an ontology with room for
formalization. Its re-design under DL has helped greatly in the degree of formalism and the
capacity for reasoning, however, the presence of errors as well as the non-strict adherence
to ontological principles has tempered this improvement.
As an example, Ceusters et al have identified errors of the following nature detected
5The information in this section was obtained from SNOMED International Historical Perspectives, unlessotherwise indicated: http://www.snomed.org/about/perspectives.html Accessed: June 2005; Updated: 2005.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 166
within SNOMED-CT: human error, technology-induced errors, meaning shifts (from the
transfer of SNOMED-RT to -CT), redundancy, and mistakes attributable to the underlying
ontological theory [24]
Although Bodenreider et al acknowledge that SNOMED-CT’s overall coherence does
permit reasoning, they are still able identify class descriptions that are “minimal or incom-
plete, with possible detrimental consequences on inheritance” [16]. Some of these problems
are attributable to taxonomic relations, or to issues of multiple inheritance (over 27% of
classes within SNOMED-CT were found to have more than one parent), where a parent
and child share roles with values that cannot be linked via inheritance (i.e. such as iden-
tity). Also, the presence of single-child classes does not comply with the ontological and
classification criteria above indicating the possible presence of errors: if a class has only
one child, it is questionable that a distinction should even exist between parent and child
[21, 16]. Single-child classes can arise as a result of incompleteness in the hierarchy; hybrid
classes, where two parent classes intersect, the child may be a single child of one of the
parent classes; or redundant classes, where there is no evidence of refinement or difference
between parent and child, suggesting incomplete descriptions [16]. In approximately 56% of
the single child cases there was no connection to hybrid classes, thus the child was simply a
refinement of its only parent.
The presence of an overly large number of children may also point to incomplete de-
scriptions leading to a lack of discrimination within the terminology [15].
The General Architecture for Languages Encyclopaedias and Nomenclatures
in Medicine Project (GALEN)
As part of the Advanced Informatics in Medicine (AIM) Program, the Generalized Archi-
tecture for Languages, Encyclopaedias and Nomenclatures in medicine, otherwise known as
GALEN, was developed in the early 1990s by the European Commission for representation
of surgical procedures6 [26, 115]. Compositional in design, the project’s primary goal was
the development of an alternative to “static look-up terminologies” in the form of a “ter-
minology server”, a client-server system that mediates access across various ontologies and
information systems, while facilitating the development of new systems [115, 116]. This
6OpenGALEN Homepage: http://www.opengalen.org/technology/galen-faq.html; Accessed February2006; Updated 1999.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 167
makes it possible to reference concepts, pose queries, and translate concepts between repre-
sentations [115]. The terminology server functions also as an interface between applications,
and a facilitator for the development and integration of new concepts.
The basis of the server is the language-independent, ontological concept reference model
known as CORE (the Common Reference Model) [115, 116, 124, 147]). Represented using
GRAIL, a DL-like knowledge-representation formalism developed specifically for medical
terminology, CORE allows the development of medical applications that can successfully
intercommunicate based on a common (meta-) language [5, 115]. The goal of CORE is to
“represent the underlying conceptual model of medicine shared across national boundaries”7. Highly specific knowledge, for instance pertaining to protocols, is not part of the CORE
model itself, but rather employs the model as a foundation for representation. Roughly, if
it is considered non-controversial, widely-accepted knowledge it will be present in CORE,
giving rise to the developer’s slogan “managing diversity, without imposing uniformity”8.
This is rather nicely expressed as “the level of detail two specialists need to talk about
medicine outside their specialty”9.
GALEN works by translating natural, free text into intermediary, simplified conceptual
representations called dissections [147]. This representation is a tradeoff between the free
expression of natural language and highly complex knowledge structures. From here, the
dissection is then translated into its GRAIL representation and classified according to the
CORE model.
The GALEN terminology server is itself not a terminology, but a knowledge representa-
tion infrastructure to support existing terminologies (ranging from coding schemes such as
ICD-10 to more complex taxonomies such as SNOMED), and the development and reclassi-
fication of new terminologies [147]. By authoring the necessary conceptual representations
in GRAIL, it is possible to construct classifications in GALEN, as well as “ensure the main-
tenance, extensibility and coherence of existing ones” [147, page 73].
CORE consists of more than 13,000 elementary concepts, approximately 800 of which
consist of a set of roles, as well as a series of production rules for generating complex concepts
[128, 147]. Approximately 5700 concepts are composite, depending on the maintenance
of other definitions [147]. This generation of concepts is restricted to what is considered
7OpenGALEN Homepage.8OpenGALEN Homepage.9OpenGALEN Homepage.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 168
medically sensible; complex concepts can only be generated once they have been sanctioned
by the knowledge constraints within the system. This has the effect of reducing over-
production of nonsensical concepts.
While GALEN and SNOMED both use DL-based formalisms, GRAIL is a more compre-
hensive DL, including additional role constructors, “namely role hierarchies, inverse roles,
role chaining and transitive roles” [31].
GALEN has bee used in a variety of projects within Europe, such as the French coding
system, CCAM, particularly because of its inherent support of multiple languages [147, 123].
OpenGALEN. OpenGALEN is a nonprofit organization that provides free and open
access to the GALEN Common Reference Model. Thus, researchers incorporating GALEN
in their systems can modify it freely and distribute their software without worry of licensing
fees. The hope is to set the stage for the development of an Open Source community for
medical terminology.
Evaluation. Unfortunately, the literature lacks an evaluative investigation of the strengths
and weakness of the GALEN formalism at this time. Much of the information available is
now out-of-date as the initiative is currently in hibernation [125, 126].
Despite the lack of content development, the GALEN formalism is far from dead. Dr.
Jeremy Rogers reports that researchers are continuing to work with GALEN and to expand
its purview beyond surgical procedures (with a particular focus on drug representations).
This includes a comparative analysis of the anatomy content of GALEN and the FMA, and
an analysis of the representation of ICD-10 diseases using GALEN [125, 126].
Work is also underway as part of the World Health Organization’s International Clas-
sification of Health Interventions (ICHI),10 on adapting OpenGALEN to handle ICHI ter-
minology. This involves testing an updated KERMANOG11 e-platform that allows users to
maintain a constant connection to a central service (as opposed to periodic connections for
resynchronisation of the knowledge base). This research will likely result in a new release
of OpenGALEN in 2006 [126].
GALEN and its various components continue to be made available via the OpenGALEN
website12, as well as OpenKnoME, the GRAIL knowledge engineering environment produced
10For more information refer to http://www.who.int/classifications/ichi/en/ Accessed: February 2006.11KERMANOG is a Dutch company that has developed applications for use with GALEN, including the
only commercially available terminology server for GALEN, and the Classification Workbench, a toolset forthe development and maintenance of classification schemes, such as the work with CCAM.
12OpenGALEN Homepage.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 169
by topThing13 .
LinKBase
LinKBase is a large-scale, proprietary, medical ontology created by Language and Com-
puting (L&C). Unlike other ontologies, LinKBase was developed using LinKFactory, L&C’s
own ontology-authoring environment [25]. LinKBase presently contains over 1.5 million
language-independent medical and general-purpose concepts (such as “human body”) and
particular instances (such as “United States”), associated with more than 4 million terms
in several natural languages [25, 27]. The concepts and instances are linked via a semantic
network containing approximately 480 link types [25]. The connections within the seman-
tic network constitute a formal framework derived from logically axiomatized theories in
mereology and topology, augmented with causality and time, and adhering to good rules of
classification [25, 24, 27]). In fact, only about 15% of the total relationships are subsumption-
based in LinKBase, the remaining 85% comprised of richer structures than is possible using
DL formalisms [24, 27].
LinKBase has recently been re-engineered in adherence with the theory of granular
partitions – the notion of representing knowledge as grids of labeled cells [9] – and basic
formal ontology, which stipulates formal distinctions between the relationship of universals
and particulars [25, 137].
The authors of LinKBase draw a subtle distinction between concepts and the entities
they represent. Within the context of LinKBase, concepts abstract the necessary features
of natural language. They do not represent abstractions of how humans think but rather
the actual real-world entities. Concepts that do refer to people’s conceptions are called
meta-entities, included to facilitate mappings to third-party ontologies [24].
In addition to LinKBase, L&C also maintains LinKFactory, a knowledge-engineering
system, and TeSSI, an engine for semantic indexing, retrieval and extraction.
Evaluation. Unfortunately, there exists no critical analyses of the LinKBase system in
the literature, which is likely attributable to the industry-nature of the system.
In [114], it is observed, however, that: “until proven otherwise, the fact that so many
commercial vendors have chosen not to use standard terminologies must be taken as strong
presumptive evidence that those terminologies are seriously flawed from the point of view of
13http://www.topthing.com/ Accessed: February 2006.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 170
practical use in scalable systems” [114, page 8]. Still, this does little to alleviate the access
problem to such ontologies for the purposes of academic research.
B.3.2 Issues in Medical Informatics/Ontologies in General
Semantic Challenges
Ontological design is a challenging task. At the semantic level, there are considerations
effecting the expressivity and the scale of the ontology. These include adequate coverage of
the necessary medical and non-medical concepts; facilitation of consistent ontology growth
[21]; and sufficient representational granularity. The granularity needs will vary depending
on the task at hand: too narrow and complexity grows, while too broad and insufficient
distinctions are drawn between concepts [5, 80]. Decisions regarding complexity can be in-
fluenced by a range of applications, from payment and administration details, to descriptions
of symptoms and procedures [5].
Other considerations involve concept definitions. For instance, in the case of polysemy,
orthographically equivalent words can represent multiple senses14. The ontology must pos-
sess sufficient discriminations to enable a computer to resolve cases of ambiguity [15, 112].
Unfortunately, if multiple senses are represented via multiple types, these multiple senses
can have a detrimental effect on parsing efficiency. This is because more semantic types
mean more choices for the parser and, consequently, a more complex task [80].
Combining semantic and syntactic information can assist in reducing ambiguity in both
the potential syntactic types of a lexeme and the potential semantic types. For instance,
Johnson observes that without syntactic knowledge to differentiate between adjective and
verb positions in a sentence, the word “left” in the sentences, “opacity seen in left lung”
and “patient left hospital”, would be considered ambiguous by a semantic parser [80].
In addition, there must be “a clear differentiation at all levels between ‘false’, and ‘not
done’ or ‘unknown’ ” within the ontology [5]. This can take the form of explicit negation
(or implied via taxonomic relationships), which represent only that which is true (and
say nothing of the truth or falsity of what is not mentioned). In some ontologies, such
as GALEN, negation is not expressed explicitly, but simulated through modifiers such as
14For example, “bank” is a polysemous word with multiple senses: The financial institution – “I depositedmoney in the bank”; A sloping surface – “I went down to the river bank”; A collection of items – “I walkedover to the bank of machines”.
APPENDIX B. ONTOLOGIES IN HEALTHCARE 171
“presence/absence” and “done/not-done” [117]. Such restrictive qualifiers, however, limit a
systems capacity to express vague concepts.
The degree of formalization is important within an ontology. As the size grows, the
ontology becomes increasingly susceptible to inconsistency without some formal system in
place, such as an axiom base. If the axiom system present is too formal, however, it may
become extremely difficult to understand and maintain the knowledge base [80].
The extent to which implied information is represented is also significant [117]. Rector
provides the example of the procedure, ‘Insertion of pins in femur’, which he points out
should imply a ‘fixation procedure’. Is the classification system responsible for automatically
deriving this implication, and if so, to what extent should implication continue? To allow
one implication is to open the door for a potential series of cascading implications which
could have detrimental consequences [117].
Structural Challenges
From a more formal perspective, a number of issues arise with respect to the structure of
the ontology itself. These include redundancy, is-a overloading, incomplete descriptions,
ontology growth, and the limits of taxonomic relationships.
Redundancy. Redundancy arises whenever there exists more than one method to encode
the same concept. For example, the concept “ruptured ovarian cyst” can be derived by
the composition of “ruptured ovary” and “cyst”, or “ruptured cyst” and “ovary” [140].
Redundancy can exist at two levels [30]: At the term level it serves a useful purpose,
allowing multiple expressions to be mapped to the same concept while the coding or
identifier of the underlying concept remains unique. Redundancy at the identifier level,
however, is generally considered problematic as it can lead to problems in inference
when there is no underlying unique representation of a concept [30]. Such redundancy
tends to arise as a result of ontological expansion. In some instances, such as in the
integration of vocabularies in the UMLS, the presence of redundancy indicates that
a concept has been asserted by multiple sources, which can be interpreted as a high
probability that that concept is semantically valid [17]. Redundancy can also allow
direct connections between important concepts that might otherwise be very distant
in the ontology. Such dependence, nevertheless, is rarely explicitly noted or rigorously
maintained, and subsequent updates can ultimately lead to inconsistency [166].
APPENDIX B. ONTOLOGIES IN HEALTHCARE 172
Is-a Overloading. It is often the case that the relationships within a taxonomy are not
constrained to strictly is-a or taxonomic relationships. Such “uncontrolled use” is re-
ferred to as is-a overloading and is often associated with multiple inheritance (allowing
subclasses to have more than one parent). It can result in subsumption errors brought
about by the different semantics of the relationships within the ontology [161].
Incomplete Descriptions. If a concept within a terminology lacks a complete description,
it may be incorrectly placed in the ontology, or may not be accurately referenced.
Ontology Growth. The structure and consistency within an ontology is sensitive to growth
and must be carefully maintained.
Limits of Taxonomic Relationships. More complicated relationships may be needed to
accurately capture the essence of some medical terms. As a result, more complicated
reasoning engines are needed to handle the variety of relationships, as well as strict
formalization to ensure that no incorrect assumptions are made about the nature of
these relationships.
Other Challenges
Aside from the semantic and structural concerns, attention must be paid to the applica-
tion itself. Johnson talks about application independence as a crucial factor to a successful
semantic lexicon [80, 49]. A well-constructed ontology should be accessible to a variety of
applications. The challenge is ensuring that it draws just enough distinctions to remain
efficient, while ensuring that the output is mappable to the standardized vocabularies and
databases. As an alternative, Johnson suggests creating an intermediate, or meta-, repre-
sentation that could be mapped into any target vocabulary or database.
Another interesting concern related to application independence is the need to ensure
that the MLP system is not too heavily dependent on a particular ontology. If it is, then
changes in the ontology may undesirably result in the need for significant changes to the
MLP system.
Choosing the appropriate level of formality and coverage for an application is also im-
portant. According to Johnson, it is parser efficiency that differentiates a semantic lexicon
from a general-purpose, controlled vocabulary [80]. In the case of a controlled vocabulary, as
much information as possible is provided for each entry; information that is not necessarily
APPENDIX B. ONTOLOGIES IN HEALTHCARE 173
required for parsing and can therefore slow down the system. Such information, however,
is often necessary for other applications. Thus, while a researcher may choose not to parse
the input using such a detailed vocabulary, they may ultimately need to map the output
into it. .
B.4 Summary
This chapter has introduced the term “ontology” with respect to medical informatics. The
principles necessary for effective ontologies have been outlined, as well as various means
for knowledge representation. Lastly, the major medical vocabularies in common use today
have been discussed.
Overall, the limits of existing ontologies with respect to breadth of coverage and the
tradeoff between expressivity and decidability, as well as the desire for interoperability
between various applications, suggest that a single, multipurpose ontology may be too much
to ask. Instead, by adhering to strict ontological principles and standards, a more attainable
goal is the construction of multi-ontological systems that will allow existing ontologies to
communicate via standard protocols, achieving greater coverage without intractability.
Appendix C
All Results
All results shown here were collected by a manual analysis of the output. Corpus Size:Training
is the the number of reports in the training set, while Corpus Size:Test is the number
of test cases on which the system was run. In all instances, unless indicated otherwise,
CorpusSize : Training = 2751 and CorpusSize : Test = 30.
Table C.1: Co-occurrence analysis with windowsize=3, threshold=0.
Accuracy Corpus SizeReport Type Recall Precision f-Measure Training Test
All 83% 26% 40% 2751 20Findings only 88% 31% 46% 2751 20
Impressions only 96% 15% 26% 2751 20Spine only 77% 35% 48% 891 10
Table C.2: Co-occurrence analysis on entire error set, windowsize=collocation
AccuracyRecall Precision f-Measure Threshold57% 19% 28% 057% 19% 29% 5E-0659% 9% 16% 5E-04
174
APPENDIX C. ALL RESULTS 175
Table C.3: Co-occurrence analysis on non-stop-words only, windowsize=collocation
AccuracyRecall Precision f-Measure Threshold96% 15% 26% 096% 16% 27% 5E-0696% 7% 14% 5E-04
Table C.4: Co-occurrence analysis on entire error set, windowsize=1
AccuracyRecall Precision f-Measure Threshold50% 36% 42% 050% 38% 43% 5E-0657% 11% 19% 5E-04
Table C.5: Co-occurrence analysis on non-stop-words only, windowsize=1
AccuracyRecall Precision f-Measure Threshold86% 29% 44% 086% 31% 46% 5E-0686% 8% 15% 5E-04
Table C.6: Co-occurrence analysis on entire error set, windowsize=10
AccuracyRecall Precision f-Measure Threshold30% 45% 36% 042% 21% 28% 5E-0694% 4% 7% 5E-04
Table C.7: Co-occurrence analysis on non-stop-words only, windowsize=10
AccuracyRecall Precision f-Measure Threshold66% 47% 82% 082% 20% 2% 5E-0682% 32% 3% 5E-04
APPENDIX C. ALL RESULTS 176
Table C.8: PMI analysis on entire error set, windowsize=collocation
AccuracyRecall Precision f-Measure Threshold65% 16% 26% 10065% 15% 25% 10168% 14% 24% 10270% 13% 21% 103
Table C.9: PMI analysis on non-stop-words only, windowsize=collocation
AccuracyRecall Precision f-Measure Threshold96% 12% 21% 10096% 11% 20% 10196% 10% 17% 10296% 8% 15% 103
Table C.10: PMI analysis on entire error set, windowsize=1
AccuracyRecall Precision f-Measure Threshold47% 19% 27% 10048% 16% 24% 10152% 12% 20% 10251% 9% 16% 103
Table C.11: PMI analysis on non-stop-words only, windowsize=1
AccuracyRecall Precision f-Measure Threshold87% 17% 28% 10087% 14% 24% 10191% 10% 19% 10291% 8% 15% 103
APPENDIX C. ALL RESULTS 177
Table C.12: PMI analysis on entire error set, windowsize=10
AccuracyRecall Precision f-Measure Threshold34% 34% 34% 10037% 31% 34% 10139% 24% 30% 10243% 19% 26% 103
Table C.13: PMI analysis on non-stop-words only, windowsize=10
AccuracyRecall Precision f-Measure Threshold73% 35% 47% 10077% 31% 44% 10182% 24% 37% 10288% 19% 41% 103
Table C.14: Combined heuristics on all errors based upon top f-measure.
Accuracy Corpus SizeReport Type Recall Precision f-Measure Training Test
Best Co-Occur 50% 38% 43% 2751 30Best PMI 35% 34% 34% 2751 30
Parser 29% 34% 32% n/a 30Hybrid 74% 46% 57% 2751 30
Table C.15: Combined heuristics on all errors based upon top recall score.
Accuracy Corpus SizeReport Type Recall Precision f-Measure Training Test
Best Co-Occur 59% 9% 16% 2751 30Best PMI 70% 13% 21% 2751 30
Parser 29% 34% 32% n/a 30Hybrid 83% 13% 22% 2751 30
Bibliography
[1] J. F. Allen, B. W. Miller, E. K. Ringger, and T. Sikorski. A robust system for naturalspoken dialogue. In Proceedings of the 34th Annual Meeting of the ACL, pages 62–70,1996.
[2] R. Altman. AI in medicine: The spectrum of challenges from managed care to molec-ular medicine. AI Magazine, 20(3):67–77, 1999.
[3] A. R. Aronson. Meta-map: mapping text to the umls metathesaurus, 1996. This isan electronic document. Date of publication: March 6,1996. Date retrieved: January14, 2006.
[4] A. R. Aronson. Effective mapping of biomedical text to the umls metathesaurus: themetamap program. In Proceedings of the AMIA Symposium, pages 17–21, 2001.
[5] Franz Baader, Diego Calvanese, Deborah McGuinness, Daniele Nardi, and Peter F.Patel-Schneider. The Description Logic Handbook: Theory, Implementation, and Ap-plications. Cambridge University Press, 2003.
[6] R. Barrows, M. Busuioc, and C. Friedman. Limited parsing of notational text visitnotes: Ad-hoc vs. nlp approaches. In Proceedings of AMIA Annual Symposium, pages50–55, 2000.
[7] M. Carmen Benitez, Antonio Rubio, Pedro Garcia, and Jesus Diaz-Verdejo. Wordverification using confidence measures in speech recognition. In Proceedings ICSLP,pages 1082–1085, November 1998.
[8] D. S. Bhachu. Introduction to pacs. In Consumers Association. Medical DevicesAgency (MDA), March 2002.
[9] T. Bittner and B. Smith. A theory of granular partitions. In Matthew Duckham,Michael F. Goodchild, and Michael F. Worboys, editors, Foundations of GeographicInformation Science, pages 117–151. Taylor and Francis Books, London, 2003.
[10] Philippe Blache. Property grammars: A fully constraint-based theory. In H. Chris-tiansen, P. R. Skadhauge, and J. Villadsen, editors, Constraint Solving and LanguageProcessing, volume 3438 of Lecture Notes in Artificial Intelligence. Springer, 2005.
178
BIBLIOGRAPHY 179
[11] Alan W. Black, Ralf D. Brown, Robert Frederking, Rita Singh, John Moody, andEric Steinbrecher. TONGUES: Rapid development of a speech-to-speech translationsystem. In Proceedings of Second International Conference on Human Language Tech-nology Research HLT, pages 183–186, March 2002.
[12] Wayne D. Blizard. Multiset theory. Notre Dame Journal of Formal Logic, 30(1):36–66,1989.
[13] Olivier Bodenreider. Circular hierarchical relationships in the UMLS: Etiology, di-agnosis, treatment, complications and prevention. In Proceedings of AMIA AnnualSymposium, pages 57–61, 2001.
[14] Olivier Bodenreider. Medical ontology research. Technical report, Lister Hill NationalCenter for Biomedical Communications, 2001.
[15] Olivier Bodenreider, Joyce A. Mitchell, and Alexa T. Mccray. Biomedical ontologies.Proceedings of the 2003 Pacific Symposium on Biocomputing, 8:562–564, 2003. SessionIntroduction.
[16] Olivier Bodenreider, Barry Smith, Anand Kumar, and Anita Burgun. Investigatingsubsumption in dl-based terminologies: A case study in SNOMED CT. In Proceedingsof the First International Workshop on Formal Biomedical Knowledge Representation(KR-MED 2004), pages 12–20, 2004.
[17] Olivier Bodenreider and Songmao Zhang. Semantic integration in biomedicine. In Pro-ceedings of the Semantic Integration Workshop at the Second International SemanticWeb Conference (ISWC 2003), pages 156–157, 2003.
[18] S. M. Borowitz. Computer-based speech recognition as an alternative to medicaltranscription. Journal of the American Medical Informatics Association, 8:101–102,2001.
[19] J. Bouaud, B. Bachimont, J. Charlet, and P. Zweigenbaum. Acquisition and struc-turing of an ontology within conceptual graphs. In Proceedings of ICCS’94 Workshopon Knowledge Acquisition using Conceptual Graph Theory, pages 1–25, 1994.
[20] C. Bousquet, M. C. Jaulent, G. Chatellier, and P. Degoulet. Using semantic distanceforthe efficient coding of medical concepts. In Proceedings of AMIA Annual Symposium,pages 96–100, 2000.
[21] Anita Burgun and Olivier Bodenreider. Aspects of the taxonomic relation in thebiomedical domain. In International Conference on Formal Ontology in InformationSystems, pages 222–233. ACM, October 17-19 2001.
[22] Anita Burgun and Olivier Bodenreider. Comparing terms, concepts and semanticclasses in WordNet and the Unified Medical Language System. In Proceedings of
BIBLIOGRAPHY 180
the NAACL’2001 Workshop, “WordNet and Other Lexical Resources: Applications,Extensions and Customizations”, pages 77–82. ACM, 2001.
[23] J. E. Caviedes and J. J. Cimino. Towards the development of a conceptual distancemetric for the UMLS. Journal of Biomedical Informatics, 37:77–85, 2004.
[24] W. Ceusters, B. Smith, A. Kumar, and C. Dhaen. Mistakes in medical ontologies:Where do they come from and how can they be detected? In D. M. Pisanelli, editor,Ontologies in Medicine: Proceedings of the Workshop on Medical Ontologies, pages16–18, Amsterdam, October 2004. IOS Press.
[25] W. Ceusters, B. Smith, A. Kumar, and C. Dhaen. Ontology-based error detection insnomed-ct. In Proceedings of MEDINFO, pages 482–486, 2004.
[26] Werner Ceusters, Jeremy Rogers, Fabrizio Consorti, and Angelo Rossi-Mori.Syntactic-semantic tagging as a mediator between linguistic representations and for-mal models: an exercise in linking SNOMED to GALEN. Artificial Intelligence inMedicine, 15:5–23, 1999.
[27] Werner Ceusters, Barry Smith, and Jim Flanagan. Ontology and medical terminology:Why description logics are not enough. In Towards an Electronic Patient Record(TEPR 2003), Boston, MA, May 10-14 2003. Medical Records Institute (CD-ROMPublication). CD-ROM Publication.
[28] L. Christensen, P. Haug, and M. Fiszman. MPLUS: A probabilistic medical under-standing system. In Proceedings of the Workshop on Natural Language Processing inthe Biomedical Domain, pages 29–36, 2002.
[29] Henning Christiansen. CHR grammars. Theory and Practice of Logic Programming,5(4):467–501, 2005.
[30] J. J. Cimino. Desiderata for controlled medical vocabularies in the twenty-first century.Methods of Information in Medicine, 37:394–403, 1998.
[31] R. Cornet and A. Abu-Hanna. Usability of expressive description logics – a case studyin UMLS. In Proceedings of the AMIA Annual Symposium, pages 180–184, 2002.
[32] Stephen Cox and Srinandan Dasmahapatra. A semantically-based confidence measurefor speech recognition. In Proceedings of Int. Conf. on Spoken Language Processing,volume 4, pages 206–209, Beijing, China, 2000.
[33] Stephen Cox and Srinandan Dasmahapatra. High-level approaches to confidence mea-sure estimation in speech recognition. IEEE Transactions on Speech and Audio Pro-cessing, 10(7):460–471, Oct 2002.
[34] Christopher Culy and S. Z. Riehemann. The limits of N-gram translation evaluationmetrics. In MT Summit IX, pages 71–78, New Orleans, USA, September 2003.
BIBLIOGRAPHY 181
[35] Veronica Dahl and Philippe Blache. Directly executable constraint based grammars. InProc. Journees Francophones de Programmation en Logique avec Contraintes, Angers,France, June 2004.
[36] Veronica Dahl and Kimberly Voll. Concept formation rules: An executable cognitivemodel of knowledge construction. In Proceedings of the First International Work-shop on Natural Language Understanding and Cognitive Sciences, pages 28–36, Porto,Portugal, April 2004.
[37] Srinandan Dasmahapatra and Stephen Cox. Meta-models for confidence estimation inspeech recognition. In Proceedings of Proceedings International Conference on Acous-tics Speech and Signal Processing, volume 3, pages 1815–1818, June 2000.
[38] E. Devine, S. Gaehde, and A. Curtis. Comparative evaluation of three continuousspeech recognition software packages in the generation of medical reports. Journal ofAmerican Medical Informatics Association, 7:462–468, 2000.
[39] A. Fall. An abstract framework for taxonomic encoding. In Proceedings of FirstInternational Symposium on Knowledge Retrieval, Use and Storage for Efficiency,pages 162–167, 1995.
[40] E. Filisko and S. Seneff. Error detection and recovery in spoken dialogue systems. InProceedings of HLT-NAACL 2004 Workshop on Spoken Language Understanding forConversational Systems, pages 31–38, Boston, MA, May 2004.
[41] M. Fiszman, W. Chapman, S. Evans, and P. Haug. Automatic identification of pneu-monia related concepts on chest x-ray reports. In Proceedings of AMIA Annual Sym-posium, pages 67–71, 1999.
[42] M. Fiszman and P. Haug. Using medical language processing to support real-timeevaluation of pneumonia guidelines. In Proceedings of AMIA Annual Symposium,pages 235–239, 2000.
[43] Bruce Forster. Private Interview. Canada Diagnostic Centre, Vancouver, BC, May23, 2003.
[44] Bruce Forster. Private Interview. Canada Diagnostic Centre, Vancouver, BC, May24, 2005.
[45] C. Friedman. A broad-coverage natural language processing system. In Proceedingsof AMIA Annual Symposium, pages 270–274, 2000.
[46] C. Friedman, P. O Alderson, J. H Austin, J. J Cimino, and S. Johnson. A generalnatural-language text processor for clinical radiology. Journal of the American MedicalInformatics Association, 1(2):161–174, 1994.
BIBLIOGRAPHY 182
[47] C. Friedman and G. Hripcsak. Natural language processing and its future in medicine:Can computers make sense out of natural language text? Academic Medicine: Journalof the Association of American Medical Colleges, 74(8):890–895, 1999.
[48] C. Friedman, S. Johnson, B. Forman, and J. Starren. Architectural requirements fora multipurpose natural language processor in the clinical environment. In Proceedingsof AMIA Annual Symposium, pages 347–351, 1995.
[49] C. Friedman, L. Shagina, Y. Lussier, and G. Hripcsak. Automated encoding of clinicaldocuments based on natural language processing. Journal of the American MedicalInformatics Association, 11(5):392–402, 2004.
[50] Thom W. Fruhwirth. Constraint handling rules. In Constraint Programming, pages90–107, 1994.
[51] Thom W. Fruhwirth. Theory and practice of constraint handling rules. Journal ofLogic Programming, Special Issue on Constraint Logic Programming, 37(1-3):95–138,October 1998.
[52] B. Gale, Y. Safriel, and A. Lukban. Radiology report production times: Voice recog-nition vs. transcription. Radiology management, 23:18–22, 2001.
[53] W. Gale and K. Church. What’s wrong with adding one? In N. Oostdijk andP. de Haan, editors, Corpus-Based Research into Language: In honour of Jan Aarts,pages 189–200. Rodopi, Rodopi, Amsterdam, 1994.
[54] L. Gillick, Y. Ito, and J. Young. A probabilistic approach to confidence measureestimation and evaluation. In Proceedings of the IEEE International Conference onAcoustics, Speech, Signal Processing, pages 879–882, April 1997.
[55] L. Gleitman. The structural sources of verb meaning. Language Acquisition, 1:3–55,1990.
[56] J. Greenspan. Introduction to XML, 1998. http://hotwired.lycos.com/webmonkey/98/41/index1a.html?tw=authoring Electronic Publication. Date retrieved: January 14,2006.
[57] J. Grimshaw. Form, function, and the language acquisition device. In C. Bakerand J. McCarthy, editors, The logical problem of language acquisition. MIT Press,Cambridge, MA, 1981.
[58] T. R. Gruber. A translation approach to portable ontology specifications. KnowledgeAcquisition, 5:199–220, 1993.
[59] A. Gunawardana, A. Hon, and H. W. Jiang. Word-based acoustic confidence measuresfor large-vocabulary speech recognition. In Proceedings of the International Conferenceon Logic Programming, volume 3, pages 791–794, 1998.
BIBLIOGRAPHY 183
[60] U. Hahn, M. Romacker, and S. Schulz. Why discourse structures in medical reportsmatter for the validity of automatically generated text knowledge bases. In Proceedingsof MEDINFO, pages 633–638, 1998.
[61] U. Hahn, M. Romacker, and S. Schulz. Discourse structures in medical reports –Watch out! The generation of referentially coherent and valid text knowledge bases inthe MEDsyndikate system. International journal of medical informatics, 53(1):1–28,1999.
[62] U. Hahn, M. Romacker, and S. Schulz. MEDsyndikate – design considerations for anontology-based medical text understanding system. In Proceedings of AMIA AnnualSymposium, pages 330–334, 2000.
[63] A. Happe, B. Pouliquen, A. Burgun, M. Cuggia, and P. Le Beux. Automatic conceptextraction from spoken medical reports. International Journal of Medical Informatics,70(2-3):255–263, July 2003.
[64] Robert Harnish. Minds, Brains, Computers: An Historical Introduction to the Foun-dations of Cognitive Science. Blackwell Publishers, 2002.
[65] Z. Harris. Mathematical Structures of Language. Wiley Interscience, 1968.
[66] Kaichiro Hatazaki, Jun Noguohi, Akitoshi Okumura, Kazunaga Yoshida, and TakaoWatanabe. INTERTALKER: an experimental automatic interpretation system us-ing conceptual representation. In Proceedings of International Conference on SpokenLanguage Processing ICSLP, Oct 1992.
[67] P. Haug, S. Koehler, L. Lau, P. Wang, R. Rocha, and S. Huff. Experience with amixed semantic/syntactic parser. In Proceedings of Annual AMIA Symposium, pages284–288, 1995.
[68] P. Haug, D. Ranum, and P. Frederick. Computerized extraction of coded findingsfrom free-text radiology reports. Radiology, 174:543–548, 1990.
[69] D. Hayt and S. Alexander. The pros and cons of implementing PACS and speechrecognition systems. Journal of Digital Imaging, 14(3):149–157, 2001.
[70] T. J. Hazen and I. Bazzi. A comparison and combination of methods for OOV worddetection and word confidence scoring. In Proceedings of the International Conferenceon Acoustics IC, volume 1, pages 397–400, May 2001.
[71] T. J. Hazen, J. Polifroni, and S. Seneff. Recognition confidence scoring for use inspeech understanding systems. Computer Speech and Language, 16(1):49–67, 2002.
[72] S. Horii, R. Redfern, H. Kundel, and C. Nodine. PACS technologies and reliability:Are we making things better or worse? In Proceedings of SPIE, volume 4685, pages16–24, 2002.
BIBLIOGRAPHY 184
[73] S. C. Horii. Primer on computers and information technology. Part four: A nontech-nical introduction to dicom. Radiographics, 17:1297–1309, 1997.
[74] Health Level Seven Inc. Health level seven home page. Accessed: February 2006 Lastknown update: 2006. http://www.hl7.org/.
[75] Diana Inkpen and Alain Desilets. Semantic similarity for detecting recognition errorsin automatic speech transcripts. In Proceedings of EMNLP, pages 49–56, Vancouver,British Columbia, Canada, October 2005. Association for Computational Linguistics.
[76] N. Jain and C. Friedman. Identification of findings suspicious for breast cancer basedon natural language processing of mammogram reports. In Proceedings of AMIAAnnual Fall Symposium, pages 29–33, 1997.
[77] M. Jeong, B. Kim, and G. Lee. Using higher-level linguistic knowledge for speech recog-nition error correction in a spoken Q/A dialog. In Proceedings of the HLT-NAACLspecial workshop on Higher-Level Linguistic Information for Speech Processing, pages48–55, 2004.
[78] J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics andlexical taxonomy. In Proceedings of International Conference on Research in Compu-tational Linguistics, pages 19–33, 1997.
[79] D. Johnson, R. Taira, A. Cardenas, and D. Aberle. Extracting information from freetext radiology reports. Journal of Digital Libraries, 1:297–308, 1997.
[80] S. Johnson. A semantic lexicon for medical language processing. Journal of theAmerican Medical Informatics Association, 6(3):205–218, 1999.
[81] D. Jurafsky and J. Martin. Speech and Language Processing: An Introduction toNatural Language Processing, Computational Linguistics, and Speech Recognition.Prentice-Hall Inc, 2000.
[82] Satoshi Kaki, Eiichiro Sumita, and Hitoshi Iida. A method for correcting errors inspeech recognition using the statistical features of character co-occurrence. In ACL-COLING, pages 653–657, 1998.
[83] K. Kanal, N. Hangiandreou, A. Sykes, H. Eklund, P. Araoz, J. Leon, and B. Erickson.Initial evaluation of a continuous speech recognition program for radiology. Journalof Digital Imaging, 14(1):30–37, March 2001.
[84] Y. W. Kim and J. H. Kim. A model of knowledge based information retrieval withhierarchical concept graph. Journal of Documentation, 2:113–137, 1990.
[85] Myoung-Wan Koo, Il-Hyun Sohn, Woo-Sung Kim, and Du-Seong Chang. KT-STS: Aspeech translation system for hotel reservation and a continuous speech recognitionsystem for speech translation. In Proceedings of Eurospeech, pages 1227–1231, 1995.
BIBLIOGRAPHY 185
[86] H. Kuhn. Speech recognition and the frequency of recently used words: A modi-fied markov model for natural language. In Proceedings of the 12th Conference onComputational Linguistics, volume 1, pages 348–350, Budapest, Hungary, 1988.
[87] K. Kukich. Techniques for automatically correcting words in text. ACM ComputingSurveys, 24(4):377–439, 1992.
[88] B. Landau and L. Gleitman. Language and Experience: Evidence from blind children.Harvard University Press, Cambridge,MA, 1985.
[89] A. Lavie, A. Waibel, L. Levin, M. Finke, D. Gates, M. Gavalda, T. Zeppenfeld,and P. Zhan. JANUS-III: Speech-to-speech translation in multiple languages. InProceedings of the 22nd IEEE International Conference on Acoustics, Speech, andSignal Processing, ICASSP, pages 1997–2004, April 1997.
[90] Alon Lavie, Lori S. Levin, Robert E. Frederking, and Fabio Pianesi. The NESPOLE!speech-to-speech translation system. In Proceedings of the 5th Conference of the As-sociation for Machine Translation in the Americas, AMTA, volume 2499 of LectureNotes in Computer Science, pages 240–243. Springer, October 2002.
[91] P. Lendvai, A. Van den Bosch, E. Krahmer, and M. Swerts. Multi-feature error detec-tion in spoken dialogue systems. In Proceedings of the 12th Computational Linguisticsin The Netherlands Meeting, pages 163–178, Nov 2001.
[92] Ping Li, Curt Burgess, and Kevin Lund. The acquisition of word meaning throughglobal lexical co-occurrences. In Proceedings of the Thirtieth Annual Child LanguageResearch Forum, pages 166–178, 2000.
[93] Henry Lieberman, Alexander Faaborg, Waseem Daher, and Jose Espinosa. How towreck a nice beach you sing calm incense. In International Conference on IntelligentUser Interfaces, pages 278–280, San Diego, January 2005.
[94] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Marie-Francine Moens and Stan Szpakowicz, editors, Text Summarization Branches Out:Proceedings of the ACL-04 Workshop, pages 74–81, Barcelona, Spain, July 2004.
[95] D. J. Litman, J. Hirschberg, and M. Swertz. Predicting automatic speech recognitionperformance using prosodic cues. In Proceedings of the 1st Meeting of the NorthAmerican Chapter of the Association for Computational Linguistics, pages 218–225,2000.
[96] Christopher D. Manning and Hinrich Schutze. Foundations of statistical natural lan-guage processing. MIT Press, 2002.
[97] J. Marion. Radiologists’ attitudes can make or break speech recognition.Diagnostic Imaging Online, 2002. http://www.diagnosticimaging.com/db area
BIBLIOGRAPHY 186
/archives/2002/0202.marion.di.pacs.shtm Electronic Document. Date of publication:February 1,2002. Date retrieved: January 14, 2006.
[98] D. G. Maynard and S. Ananiadou. Incorporating linguistic information for multi-word term extraction. In 2nd Computational Linguistics UK Research Colloquium(CLUK2), 1999.
[99] A. Mehta, K. Dreyer, A. Schweitzer, J. Couris, and D. Rosenthal. Voice recognition– an emerging necessity within radiology: Experiences of the massachusetts generalhospital. Journal of Digital Imaging, 11(4):20–23, 1998.
[100] J. Michael, J. L. Mejino, and C. Rosse. The role of definitions in biomedical conceptrepresentation. In Proceedings of the AMIA Symposium, pages 463–467, 2001.
[101] K. J. Mitchell, M. J. Becich, J. J. Berman, W. W. Chapman, J. Gilbertson, D. Gupta,J. Harrison, E. Legowski, and R. S. Crowley. Implementation and evaluation of anegation tagger in a pipeline-based system for information extraction from pathologyreports. Medinfo, pages 663–667, 2004.
[102] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[103] S. Nelson, T. Powell, and B. Humphreys. The Unified Medical Language System(UMLS) Project, volume 71 of Encyclopedia of Library and Information Science, pages369–378. Marcel Dekker Inc, 2002.
[104] Tamar Nordenberg. Make no mistake: Medical errors can be deadly serious. FDAConsumer, 34(5), September 2000.
[105] N. F. Noy, M. A. Musen, J. L. V. Mejino, and C. Rosse. Pushing the envelope:Challenges in a frame-based representation of human anatomy. Data and KnowledgeEngineering, 48:335–359, 2004.
[106] OpenClinical. Description logics. In OpenClinical web site. Accessed: February 2006Last known update: October 18, 2004. http://www.openclinical.org /descriptionlog-ics.html.
[107] OpenGALEN. OpenGALEN FAQ. In OpenGALEN web site. Accessed: February 2006Last known update: 1999. http://www.opengalen.org/technology/galen-faq.html.
[108] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method forautomatic evaluation of machine translation. In Proceedings of the 40th Annual Meet-ing of the Association for Computational Linguistics, pages 311–318, Philadelphia,July 2002.
[109] S. Pinker. Language learnability and language development. Harvard University Press,Cambridge, MA, 1984.
BIBLIOGRAPHY 187
[110] S. Pinker. The bootstrapping problem in language acquisition. In B. MacWhinney,editor, Mechcanisms of language acquisition. Lawrence Erlbaum, HIllsdale, NJ, 1987.
[111] S. Pinker. How could a child use verb syntax to learn verb semantics? Lingua,92:377–410, 1994.
[112] D. M. Pisanelli, A. Gangemi, M. Battaglia, and C. Catenacci. Coping with medicalpolysemy in the semantic web: the role of ontologies. In Proceedings of MedInfo 2004,pages 416–419, Amsterdam, September 7-11 2004. IOS Press.
[113] R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and application ofa metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics,19(1):17–30, Feb 1989.
[114] A. Rector. Clinical terminology: Why is it so hard? Methods of information inmedicine, 38:239–252, 1999.
[115] A. Rector, W. Solomon, W. Nowlan, and T. Rush. A terminology server for medicallanguage and medical information systems. In Proceedings of IMIA WG6, volume 34,pages 147–157, 1994.
[116] A. L. Rector, J. E. Rogers, and P. Pole. The GALEN high level ontology. In Proceedingsof Medical Informatics Europe ’96 (MIE’96), pages 174–178, Amsterdam, 1996. IOSPress.
[117] Alan Rector and Jeremy Rogers. Ontological issues in using a description logic torepresent medical concepts: Experience from GALEN. In IMIA WG6 Workshop:Terminology and Natural Language in Medicine, Phoenix Arixona, November 1999.
[118] Philip Resnik. Using information content to evaluate semantic similarity in a taxon-omy. In IJCAI, pages 448–453, 1995.
[119] R. Richardson, A. Smeaton, and J. Murphy. Using wordnet as a knowledge basefor measuring semantic similarity between words. Technical Report Working PaperCA-1294, School of Computer Applications, Dublin City University, 1994.
[120] T. Rindflesch, J. Rajah, and L. Hunter. Extracting molecular binding relationshipsfrom biomedical text. In Proceedings of the 6th Applied Natural Language ProcessingConference (ANLP-NAACL 2000), pages 188–195, 2000.
[121] E. K. Ringger and J. F. Allen. A fertility model for post correction of continuousspeech recognition. In ICSLP96, pages 897–900, 1996.
[122] J. F. Roddick, K. Hornsby, and D. deVries. A unifying semantic distance model fordetermining the similarity of attribute values. In M. J. Oudshoorn, editor, Proc.Twenty-Sixth Australasian Computer Science Conference (ACSC2003), volume 16,pages 111–118, 2003.
BIBLIOGRAPHY 188
[123] J. M. Rodrigues, B. Trombert-Paviot, R. Baud, J. Wagner, and F. Meusnier-Carriot.GALEN-In-Use: Using artificial intelligence terminology tools to improve the linguisticcoherence of a national coding system for surgical procedures. Medinfo, 9(1):623–627,1998.
[124] J. Rogers, A. Roberts, D. Solomon, E. van der Haring, C. Wroe, P. Zanstra, andA. Rector. GALEN ten years on: Tasks and supporting tools. In Medinfo, volume 10,pages 256–260, 2001.
[125] Jeremy Rogers. Electronic Mail Correspondence. University of Manchester, Manch-ester, UK, March 4, 2005.
[126] Jeremy Rogers. Electronic Mail Correspondence. University of Manchester, Manch-ester, UK, June 21, 2005.
[127] Walter Rolandi. Alpha bail. Speech Technology Magazine, 11(1), January 2006.
[128] Patrick Ruch. Applying Natural Language Processing to Information Retrieval inClinical Records and Biomedical Texts. PhD thesis, University of Geneva, March2003.
[129] C. D. Lane S. Shiffman, W. M. S. Detmer and L. M. Fagan. A continuous-speechinterface to a decision support system: I. Techniques to accommodate misrecognizedinput. AMIA, 2:36–45, 1995.
[130] N. Sager, M. Lyman, C. Bucknail, N. Nhan, and L. J. Tick. Natural language pro-cessing and the representation of clinical data. Journal of the American MedicalInformatics Association, 1(2):142–160, 1994.
[131] Arup Sarma and David Palmer. Context-based speech recognition error detection andcorrection. In Proceedings of the HLT-NAACL 2004, pages 85–88, 2004.
[132] SCAR. SCAR Expert Hotline: Speech recognition. In Eliot Siegal, editor, SCARSpring Newsletter. Society for Computer Applications in Radiology, April 2002.
[133] U. Sinha, B. Dai, D. B Johnson, R. Taira, J. Dionisio, G. Tashima, M. Golamco,and H. Kangarloo. Interactive software for generation and visualization of structuredfindings in radiology reports. Am.J.Roentgenology, 175(3):609–612, September 2000.
[134] U. Sinha, A. Yaghmai, B. Dai, L. Thompson, R. Taira, J. Dionisio, and H. Kangarloo.Evaluation of SNOMED 3.5 in representing concepts in chest radiology reports: Inte-gration of a SNOMED mapper with a radiology reporting workstation. In Proceedingsof AMIA Annual Symposium, pages 799–803, 2000.
[135] G. Skantze and J. Edlund. Early error detection on word level. In ISCA Tutorialand Research Workshop (ITRW) on Robustness Issues in Conversational Interaction,Norwich, UK, 2004.
BIBLIOGRAPHY 189
[136] Gabriel Skantze. Error detection in spoken dialogue systems, 2002. Term pa-per, Graduate School for Language Technology, Faculty of Arts, Goteborg Uni-versity. Course project in dialogue systems. Available: http://www.speech.kth.se/∼gabriel/publications.html Accessed: February 2006 Last known update: September15, 2005.
[137] Barry Smith, Anand Kumar, and Thomas Bittner. Basic formal ontology for bioin-formatics. Journal of Information Systems, 2005.
[138] Neil Smith. Chomsky: Ideas and Ideals. Cambridge University Press, second edition,1999.
[139] SNOMED. SNOMED International: Historical perspectives. In SNOMED Inter-national web site. Accessed: February 2006 Last known update: June 3, 2005.http://www.snomed.org/about/perspectives.html.
[140] K. Spackman, K. Campbell, and R. Cote. SNOMED RT: A reference terminology forhealth care. In Proceedings of AMIA, pages 640–644, 1997.
[141] G. Spanoudakis and P. Constantopoulos. Similarity for analogical software reuse: Acomputational model. In 11th European Conference on Artificial Intelligence ECAI94,pages 18–22, Amsterdam, The Netherlands, 1994.
[142] G. Spanoudakis and P. Constantopoulos. Elaborating analogies from conceptual mod-els. International Journal of Intelligent Systems, 11(11):917–974, 1996.
[143] P. Spyns. Natural language processing in medicine: An overview. Methods of infor-mation in medicine, 3:285–301, 1996.
[144] R. Taira and S. Soderland. A statistical natural language processor for medical reports.In Proc. AMIA Fall Symposium, pages 970–974, 1999.
[145] R. Taira, S. G Soderland, and R. M Jakobovits. Automatic structuring of radiologyfree-text reports. RadioGraphics, 21(1):237–245, Jan 2001.
[146] Paul Thagard. Mind: Introduction to Cognitive Science. Cambridge. The MIT Press,2005.
[147] B. Trombert-Paviot, J. M. Rodrigues, J. E. Rogers, R. Baud, E. van der Haring, A. M.Rassinoux, V. Abrial, L. Clavel, and H. Idir. GALEN: a third generation terminol-ogy tool to support a multipurpose national coding system for surgical procedures.International Journal of Medical Informatics, 58-59(1):71–85, 2000.
[148] D. Tudhope and C. Taylor. Navigation via similarity: automatic linking based onsemantic closeness. Information Processing and Management, 33(2):233–242, 1997.
BIBLIOGRAPHY 190
[149] P. D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. InProceedings of the Twelfth European Conference on Machine Learning, pages 491–502, Freiburg, Germany, 2001.
[150] M. Turunen and J. Hakulinen. Agent-based error handling in spoken dialogue systems.In Proceedings of Eurospeech, pages 2189–2192, 2001.
[151] Geoffrey Underwood, editor. Oxford guide to the mind. Oxford University Press,Oxford, New York, 2001.
[152] Kimberly Voll. Medical language processing. ACM Journal of Computing Surveys,2005. submitted.
[153] Kimberly Voll. A Methodology of Error Detection: Improving Speech Recognition inRadiology. PhD thesis, Simon Fraser University, School of Computing Science, 8888University Drive, Burnaby, BC, Canada, June 2006.
[154] Kimberly Voll, Stella Atkins, and Bruce Forster. Improving the utility of speechrecognition through error detection. In SCAR Annual Meeting, 2006. in press.
[155] Kimberly Voll, Tom Yeh, and Veronica Dahl. An assumptive logic programmingmethodology for parsing. International Journal on Artificial Intelligence Tools,10(4):573–588, 2001.
[156] W. Wahlster, editor. Verbmobil: Foundations of Speech-to-Speech Translation.Springer, 2000.
[157] Alex Waibel, Ajay N. Jain, Arthur E. McNair, Joe Tebelskis, Louise Osterholtz,Hiroaki Saito, Otto Schmidbauer, Tilo Sloboda, and Monika Woszczyna. JANUS:Speech-to-speech translation using connectionist and non-connectionist techniques.In Proceedings of Advanced Neural Information Processing Systems, pages 183–190,1991.
[158] C. Wang and C. E Kahn. Potential use of extensible markup language for radiologyreporting: A tutorial. RadioGraphics, 20:287–293, 2000.
[159] Julie Weeds and David Weir. Co-occurrence retrieval: A flexible framework for lexicaldistributional similarity. Computational Linguistics, 31(4), 2006.
[160] D. L. Weiss. Speech recognition need not slow reporting time. SCAR ConferenceReporter, August 2003.
[161] C. Welty and N. Guarino. Supporting ontological analysis of taxonomic relationships.Data and Knowledge Engineering, 39(1), 2001.
[162] F. Wessel, R. Schluter, K. Macherey, and H. Ney. Confidence measures for largevocabulary continuous speech recognition. IEEE Transactions on Speech and AudioProcessing, 9(3):288–298, March 2001.
BIBLIOGRAPHY 191
[163] W. Woods and J.Schmolze. The KL-One family. Computers and Mathematics withApplications, 23(2-5):133–177, 1992.
[164] Manuel Zahariev. A (Acronyms). PhD thesis, Simon Fraser University, Vancouver,BC, June 2004.
[165] O. R. Zaıane, A. Fall, S. Rochefort, and V. Dahl. Concept-based retrieval usingcontrolled natural language. In Proceedings of Computer-Assisted Searching on theInternet, pages 335–355, 1997.
[166] Songmao Zhang and Olivier Bodenreider. Investigating implicit knowledge in ontolo-gies with application to the anatomical domain. In Proceedings of the 2004 PacificSymposium on Biocomputing, pages 164–175. World Scientific Publishing Co., 2003.