NCBI, NLM, NIH May 11 2012
Ritu Khare
Drexel University College of Medicine
Philadelphia, PA
1
Understanding Clinical Forms: Structure Discovery and SNOMED CT Annotation
Presentation Order
2
1. Motivation A flexible EHR
2. Form Understanding Structure Discovery Form Annotation
3. Contributions and Plans
Clinicians & Electronic Health Records(EHRs)
3
Clinician
Electronic Health
Records
IT professionals and vendors
Inconsistent (Gurses et al. ,2009)
Inflexible(Gurses et al. ,2009, An et al. 2009)
Unintended consequences (Ash et al. 2004, Lee 2007, Harrison et al. 2007)
Data Collection Needs
Integration of New Needs
Overall Workflow
The flexible Electronic Health Record (fEHR) Form-based approach (using “forms” as design artifacts)
4
1. clinicians' high familiarity quotient on forms
2. rich information embedded in forms to guide DB design
I want to collect patient’s information, personal and vital signs, etc
EHR Database
Clinician
The fEHR System
Form Design
(or Import) Interface
Form
Understanding and Mapping
The flexible EHR: Key Challenges
5
EHR Database
The fEHR System
Form Design
Interface
Form Understanding
Form Mapping
Clinician
1 2 3
Usability Information Extraction Schema & Data Integration
Structure Discovery Form Annotation
Presentation Order
6
1. Motivation A flexible EHR
2. Form Understanding Form Structure Discovery Hidden Markov Models
Form Annotation 3. Contributions and Plans
Structure Discovery
7
The form tree accurately captures the contextual associations among the form elements. (Dragut et al. 2009, Wu et al. 2009)
A Clinical Form The Corresponding Form Tree
:text label :format :value
Challenges of Automatic Structure Discovery
8
Designed for human
understanding Visual arrangements Past experiences
For a machine, form is an unstructured
document Source code contains only
presentation/formatting structure
Existing Approaches (Zhang et al., 2004 and He et al., 2004) Short search forms Rules and heuristics
Analysis of the Form Design Process
9
Elements and their sequence: Visible
Medical decision segment
Assessment segment Orders segment
Category label
field
format
Misc. text Subcategory label
subfield
field
format
Misc. text
subformat
Demographics segment
Segment boundaries and roles: Hidden and arbitrarily laid out
Form design process can be modeled using Hidden Markov Models.
Using Hidden Markov Model(HMM)
10
HMM: A finite state automaton with stochastic
state transitions and symbol emissions (Rabiner, 1989)
Used to model and decode the real world processes which are implicit and unobservable
2-layered HMM T-HMM: assigns tags to
elements, e.g., category, field, format, etc.
S-HMM: creates groups of contextually related elements.
HMM-based artificial designer
T-HMM
category
field
format
category S-HMM
Inner Functionality of the 2-layered HMM
11
text text area
category field format Misc-text
Begin-segment
End-segment/ End sub segment
Begin sub segment
Inside segment
Parser
T-HMM
S-HMM
text area
text text text text text text checkbox
sub-category
field Misc-text
format field format
Inside sub segment
Algorithms
Supervised Training: Expectation Maximization
Testing: Viterbi
Tree Generation Overall Approach
12
Datasets (52 forms from 6 medical institutions)
13
Dataset Avg. #Text Avg. #Inputs
1 Walk in clinic encounter forms (3 forms)
32.33 49.33
2 Nursing patient admission forms (6 forms)
17.17 33
3 OB/GYN forms (7 forms) 16.14 37.29
4 Adult visit encounter forms (18 forms)
47.83 65.22
5 Family practice forms (13 forms)
82.61 100.46
6 Child visit encounter forms (5 forms)
53 67.4
Home-grown interface
Home-grown DIY interface that captures designer’s on-the-fly intentions
HMM Training Data
T-HMM and S-HMM state sequences for each form
Gold Benchmark
For result evaluation: 52 Gold Std Trees
T-HMM: category, field, format, category, field, format, …
S-HMM: begin, inside, eng, begin, inside, end,…
Results: Tree Extraction (Structure Discovery) Accuracy
14
An average tree with 135 edges gets generated in 0.08 seconds.
Dataset1 Dataset2 Dataset3 Dataset4 Dataset5 Dataset6
Total Tree Edges
272 362 461 2606 2674 644
Accuracy 95.22% 97.51% 100% 97.58% 98.46% 96.11%
HMM Testing
Cross-validation leave 1 out method
Conclusions
HMMs are very effective for structure discovery
Subsume existing approaches
Presentation Order
15
1. Motivation A flexible EHR
2. Form Understanding Form Structure Discovery Hidden Markov Models
Form Annotation Bayesian Classifier
3. Contributions and Plans
Form Annotation Semantic Heterogeneity across clinical data sources (Halevy, 2005,
Henry et al. 1993, Hernandez et al. 2005, Wright et al., 1999)
16
Diastolic/Systolic
Medical Record Number Med Rec #
BP
MRN
Blood Pressure
Constitutional Vital Signs Physical Status
?
Controlled Medical Vocabularies should be involved in the design artifacts of the healthcare systems. (Jean et al., 2007, Sugumaran and Storey, 2002)
Form Template (Design Artifact) EHR Database
fEHR
Form Annotation
The Systematized Nomenclature of Medicine - Clinical Terms (Intl. Health Terminology Stds. Dev. Org)
Most comprehensive clinical vocabulary (SNOMED CT User Guide, 2009).
>360,000 logically-defined clinical concepts (Hina et al., 2010, Stenzhorn et al., 2009).
SNOMED CT Clinical Encounter Form
17
Form Term
Patient
MRN
SNOMED CT Concept
11615400: Patient (person)
398225001: Medical record number (observable entity)
SNOMED CT Concepts
concept id: 0231832
18
concept id: 362508001
Fully-specified-name: Respiratory Rate (Observable Entity)
Fully-specified-name: Both eyes, entire (Body Structure)
Preferred Term: Respiratory Rate
Synonym: Respiration Frequency
Preferred Term: Both eyes, entire
Synonym: OU- Both eyes
SNOMED CT
Semantic Categories
•Attribute
•Body Structure
•Disorder
•Finding
•Observable Entity
•Occupation
•Person
•Physical Object
•Procedure
•Racial Group
•Situation
•…
SNOMED CT Browsers: (Rogers and Bodenreider, 2008) Existing Annotation Services
General Search Category Specific Search
19
Form Annotation Challenges Diversity Challenge
Different clinicians - different terms MRN, Med. Rec.# Vital signs, Constitutional, Physical
status
Context Challenge Same Form Term - Different
Concepts.
20
Solution Premises
The key is to identify the SNOMED CT semantic category appropriate for a given term.
The first, i.e., the most string-similar, result retrieved by the category-specific search is usually the desired concept.
How to automatically determine the SNOMED CT Semantic Category appropriate for a given form term ? ?
21
Naïve Bayes Classifier Based on the Bayes theorem (Han
and Kamber 2006).
Class Labels (SNOMED CT semantic categories ) attribute, body structure, disorder,
… Classification Features (local
structure) Node type Parent node type Child node Type Parent Semantic Category Grandparent Semantic Category
22
The implicit relationship between the term context (i.e., the form tree) and the desired semantic category can be formally captured into a STATISTICAL MODEL.
root
Patient Examination
Name Gender Respiratory
M F nl perc.
Person Procedure
Observable Entity
Observable Entity
Qualifier Value Qualifier Value
Observable Entity
Finding
Classification Model
Category Membership Probabilities
Structure Analyzer
Features
Form Annotation Algorithm and Implementation
23
SNOMED CT Category Specific Search (API)
Form Term
Form Tree
SNOMED CT Concept
Training Data
Category Picker
Semantic Category
root
Patient Examination
Name Gender Respiratory
Person Procedure
Observable Entity
Observable Entity Observable
Entity
Manual Annotations
Hybrid = Contextual Structure + Linguistics
Data
24
Manual (Gold) Annotations
Total 4235 form terms were manually studied and 2506 (59%) had corresponding SNOMED CT concept
Some Unmapped Terms
no scleral icterus
chronic back pain
Follow up with PCP
Sent to ER
Term Concept ID
Patient 11615400: Patient (person)
MRN 398225001: Medical record number (observable entity)
… ……………….
Dataset Avg. # Terms
SNOMED CT Mappability
1 Walk in clinic encounter forms (3 forms)
32.33 75.77 %
2 Nursing patient admission forms (6 forms)
17.17 63.98%
3 Labor & delivery DB data-entry forms (7 forms)
16.14 58.8 %
4 Adult visit encounter forms (18 forms)
47.83 56.2%
5 Family practice forms (13 forms)
82.61 59.38%
6 Child visit encounter forms (5 forms)
53 62.21%
Experiment Design
25
Baseline (linguistics only)
Goal: To study whether…
structure can improve annotation performance.
Measures
Precision # correct annotations/# annotations
Recall # correct annotations/# gold annotations
Classification Model
Category Membership Probabilities
Structure Analyzer
Features
SNOMED CT Category Specific Search
Form Term
SNOMED CT Concept
Category Picker
Semantic Category
SNOMED CT General Search
Form Term
SNOMED CT Concept
Hybrid (linguistics + structure)
Classification Model
Category Membership Probabilities
Structure Analyzer
Features
SNOMED CT Category
Specific Search
Form Term
SNOMED CT Concept
Category Picker
+candidate set expansion
Semantic Category
Hybrid++
Results
Baseline: p=0.60, r= 0.46 Baseline to Hybrid
Precision improved 26% Hybrid to Hybrid++
Precision improved 13% Recall improved 17%
Hybrid++: p=0.86, r= 0.60 (F-score = 0.71)
Term processing component
remove special characters (-, #, /,)
acronym expansion BTL (Bilateral Tubal Litigation)
VTE (Venous Thromboembolism)
Precision only slightly improved (3-5%) Recall improved majorly (25%) Final p= 0.89, r = 0.76 (F-score =0.82)
26
Annotation duration /form = 1- 11 s
Implications
Contextual structure improves the overall annotation performance
Linguistics only influence the recall
Presentation Order
27
1. Motivation A flexible EHR
2. Form Understanding Form Structure Discovery Hidden Markov Models
Form Annotation Bayesian Classifier
3. Contributions and Plans
Summary: Clinical Form Understanding
1. Structure Discovery
2. SNOMED CT Annotation
28
Hidden Markov Models High accuracy( 97.85%)
Limitations Supervised learning Weak entities, and other constraints Advanced form features
Naïve Bayes Classifier 0.89 (precision) and 0.76 (recall) Structure helps improve annotation
43% precision, 29% recall
• Limitations • Supervised learning • Leverage limited semantics from
SNOMED CT
Related Publications: CIKM 2009, SIGMOD Record 2010, IHI 2010, ER 2011, IHI 2012
cx1 cx2
cy1
cy2
Application: the flexible EHR
29
The fEHR System
Design or Import Form
Form Understanding
Mapping Algorithms
Clinician
1 2 3
EHR Database
• Discover Semantic Correspondences
• Evolve Existing Database
Experiments
52 forms (from 6 clinics) generate 6 databases (35-450)
Annotation helps improve the integration process (database quality by 13%, merging scenario identification by 19%)
Other Applications
Structure Discovery SNOMED CT Annotation
30
Web Search Form Understanding
Deep Web Visibility Meta-search Engines
Used on any domain Movies, health, automobile, … Biological Forms
Clinical form-driven database design process.
Database elements are named after form terms
To prepare databases for future integration.
Current and Future Projects Improving Form Annotation Unstructured EHR/Web data
31
Involve expert annotator to prepare gold standards
Specialty specific forms OB/GYN
Use other UMLS terminologies
Post coordinated mapping
Extract structure from narrative data visit notes, discharge summaries
Error control algorithms
A Typical Patient Visit Note (created by physician)
Acknowledgements Computer and Information Scientists
Physicians and Clinical Researchers
32
Dr Yuan An Dr Tony Hu Dr Jason Li Dr Min Song Dr Il-Yeol Song Dr Christopher Yang
Dr Prudence Dalrymple Dr Kalatu Davies Dr Michele Follen Dr Sandra Hartmann Dr Paul Nyirjesy Dr Sandra Wolf
Thank you
33