segmentation of clinical texts
DESCRIPTION
This was presented at the 2014 IEEE Conference on Big Data For details: http://kavita-ganesan.com/content/general-supervised-approach-segmentation-clinical-texts Citation: Ganesan, Kavita, and Michael Subotin. "A General Supervised Approach to Segmentation of Clinical Texts."TRANSCRIPT
Kavita Ganesan & Michael Subotin
Presented at: 2014 Conference on IEEE Big Data
All sorts of notes types!
Admit notes ◦ documenting why patient is being admitted◦ baseline status, etc.
Progress notes◦ progress during course of hospitalization
Discharge notes◦ conclusion of a hospital stay or series of treatments
Others◦ Operative notes◦ Procedure notes◦ Delivery notes◦ Emergency Department notes, etc
PRIMARY CARE PHYSICIAN:
Dr. XXXXX XXXXXXXX.
CHIEF COMPLAINT:
Injured right little toe.
HISTORY OF PRESENT ILLNESS:
This is a 63-year-old male with a past medical history of multiple
myeloma who presents today after hitting his fifth toe of the right foot
on a wood panel yesterday……
Review of Systems:
CONSTITUTIONAL: No fever, chills, or weight loss.
RESPIRATORY: No cough, shortness of breath, or wheezing.
CARDIOVASCULAR: No chest pain, chest pressure, or palpitations.
...............
PAST MEDICAL HISTORY
Multiple myeloma, peripheral neuropathy, hypertension..
PAST SURGICAL HISTORY:-
Stem cell transplant.
SOCIAL HISTORY
The patient formerly smoked tobacco; however, quit within the last 10
years.
FAMILY HISTORY:
Hypertension.
ALLERGIES:
ASPIRIN.
………
Purpose of visit
Patient’s current condition in
narrative form
Ongoing issues, issues in the past
Information on allergies
PRIMARY CARE PHYSICIAN:
Dr. XXXXX XXXXXXXX.
CHIEF COMPLAINT:
Injured right little toe.
HISTORY OF PRESENT ILLNESS:
This is a 63-year-old male with a past medical history of multiple
myeloma who presents today after hitting his fifth toe of the right foot
on a wood panel yesterday……
Review of Systems:
CONSTITUTIONAL: No fever, chills, or weight loss.
RESPIRATORY: No cough, shortness of breath, or wheezing.
CARDIOVASCULAR: No chest pain, chest pressure, or palpitations.
...............
PAST MEDICAL HISTORY
Multiple myeloma, peripheral neuropathy, hypertension..
PAST SURGICAL HISTORY:-
Stem cell transplant.
SOCIAL HISTORY
The patient formerly smoked tobacco; however, quit within the last 10
years.
FAMILY HISTORY:
Hypertension.
ALLERGIES:
ASPIRIN.
………
Purpose of visit
Patient’s current condition in
narrative form
Ongoing issues, issues in the past
Information on allergies
This is how most notes look:• some longer, some shorter• different set of headers, etc
PRIMARY CARE PHYSICIAN:
Dr. XXXXX XXXXXXXX.
CHIEF COMPLAIN:
Injured right little toe.
HISTORY OF PRESENT ILLNESS:
This is a 63-year-old male with
a past medical history of…
Review of Systems:
CONSTITUTIONAL: No fever,
chills, or weight loss.
CARDIOVASCULAR: No chest pain,
chest pressure, or palpitations.
...............
………
PRIMARY CARE PHYSICIAN:
Dr. XXXXX XXXXXXXX.
CHIEF COMPLAIN:
Injured right little toe.
HISTORY OF PRESENT ILLNESS:
This is a 63-year-old male with
a past medical history of…
Review of Systems:
CONSTITUTIONAL: No fever,
chills, or weight loss.
CARDIOVASCULAR: No chest pain,
chest pressure, or palpitations.
...............
………
PRIMARY CARE PHYSICIAN:
Dr. XXXXX XXXXXXXX.
CHIEF COMPLAINT:
Injured right little toe.
HISTORY OF PRESENT ILLNESS:
This is a 63-year-old male with
a past medical history of…
Review of Systems:
CONSTITUTIONAL: No fever,
chills, or weight loss.
CARDIOVASCULAR: No chest pain,
chest pressure, or palpitations.
...............
………
Very unstructured◦ formatting cues inconsistent◦ varies: across physicians, notes,
hospitals
Hard to analyze specific sections◦ E.g. analyze allergies patient population ◦ Need to segment notes to extract
all allergy info.
◦ Information collected vary from note types to note types Ex. info on progress notes vs. admit note
◦ Contents & formatting can vary from hospital to hospital Even within the same organization – E.g. Kaiser
◦ Contents & formatting vary between physicians Different styles, speed of typing, etc.
If you are looking at a single note type, from a single hospital - then maybe
Not suitable as a general segmentation approach:
Can easily break:◦ on unseen note types and minor format variations◦ Example: regex based on all caps regex based on seen headers only
Several works have explored supervised methods to segmenting clinical notes[Cho et al. 2003, tepper et al. 2012, apostolva et al. 2009]
Problem: methods not general!◦ Cho et al. 2003: One model for each type of note 20 note types 20 models! Not practical maintain each model
◦ Tepper et al. 2012: Model had low adaptability to unseen documents features used, training data used, etc.
General segmentation approach for clinical texts
Requirements: ◦ Single model/approach for most note types ◦ Discount extreme non-standard formatting
e.g. tabular format
Segment:◦ Header◦ Top level sections◦ Footer
PRIMARY CARE PHYSICIAN:
Dr. XXXXX XXXXXXXX.
CHIEF COMPLAINT:
Injured right little toe.
HISTORY OF PRESENT ILLNESS:
This is a 63-year-old male with a past medical history of multiple
myeloma who presents today after hitting his fifth toe of the right foot
on a wood panel yesterday……
Review of Systems:
CONSTITUTIONAL: No fever, chills, or weight loss.
RESPIRATORY: No cough, shortness of breath, or wheezing.
CARDIOVASCULAR: No chest pain, chest pressure, or palpitations.
...............
PAST MEDICAL HISTORY
Multiple myeloma, peripheral neuropathy, hypertension..
PAST SURGICAL HISTORY:-
Stem cell transplant.
SOCIAL HISTORY
The patient formerly smoked tobacco; however, quit within the last 10
years.
FAMILY HISTORY:
Hypertension.
ALLERGIES:
ASPIRIN.
………
Header
Top-level section
Top-level section
Top-level section
Top-level section
Top-level section
Top-level section
Top-level section
Supervised approach using L1-Logistic Regression with a constraint combination approach
Idea: scan each line in a clinical document and label as:◦ BeginHeader◦ ContHeader◦ BeginSection◦ ContSection◦ Footer
Labels are predicted with certain confidence
But, problem using line-wise predictions as is:◦ Label sequences may not make sense ◦ E.g. There maybe a BeginHeader after a BeginSection
incorrect
Post-processing: enforce sequence combination rules:◦ First line of document: BeginHeader or BeginSection◦ BeginHeader cannot come right after BeginHeader or ContHeader◦ ContHeader must come after BeginHeader or ContHeader◦ ContSection must come after BeginSection or ContSection◦ Footer cannot come right after BeginHeader or ContHeader
Rules applied after all lines in document labeled◦ Applied to consecutive label pairs ◦ Computed efficiently: Viterbi algorithm
• Notes from 12 different enterprises• Some large enterprises• All sorts of note types• Some noisy sectioning, some clean
Inpatient Outpatient
• 100 radiology notes• Fairly clean sections
• One hospital • All sorts of note types• Fairly well sectioned• 35, 000 notes in total
• 2000 randomly sampled notes(inpatient)
• 100 radiology notes• Fairly clean sections
Emphasis on training data
Variation in training data ◦ Use different note types for training◦ Intuition: help model generalize well
Sample training data:◦ Instead of using all training data from 2100 notes◦ Generated subsets of training data with varying size and
cross-validate on test sets◦ Intuition: allows to pick the best model Best model only used < 700 notes (out of 2100)
5 test sets◦ 4/5 test set from hospitals not in train set true estimate of accuracy
◦ Covers both inpatient and outpatient notes ◦ Covers different note types◦ ~12,500 test notes
Primary evaluation metric: line-wise accuracy ◦ percentage of correctly predicted line labels
Train set 3-folded cross
validationUnseen test
accuracy
Inp1HospB (300 - limited) 96.70% 67.00%
Inp3HospD (300 - varied) 96.58% 88.23%
Important to have variety in training notes in building general segmentation model
1st model: limited variety (hp + discharge)
2nd model: variety (11 types - hp, ds, pn…)
3-folded cross-validation accuracy: high in both
Model with variety: higher accuracy on unseen test set
Client/Data In/Outpatient # Test Docs Accuracy
1. Inp1HospB In 300 92.58%
2. Inp2HospC In 1000 93.29%
3. Inp3HospD In 300 95.81%
4. Rad1MixedHosps Out 9000 92.45%
5. Rad2HospA Out 1902 93.67%
Average 93.56%
Accuracy consistently > 90% across enterprises
• Average accuracy: 93.56% • Covers inpatient/outpatient
Single model: But, performs well across enterprises
Document Type Accuracy
1. History and Physical 95.70%
2. Physician Clinicals 93.10%
3. Discharge Summary 94.00%
4. Consult Note 94.60%
5. Short Stay Summary 94.60%
6. Operative Note 92.20%
7. Progress Note 87.80%
8. Cardiac Cath Report 85.40%
9. Procedure Note 83.60%
• Model performs well across note types• Lowest performance: procedure notes
low recall on segmenting “technique” sections
Performs very well > 90%
Reasonable..> 80%
Accuracy Breakdown for Inp2HospC
86.00%
87.00%
88.00%
89.00%
90.00%
91.00%
92.00%
93.00%
94.00%
0 500 1000 1500 2000
Acc
ura
cy
# Training Notes
# Notes vs. Accuracy
Avg. accurracy peaks @500 notes on all test sets
No benefit with more notes
No need for big data for a general model.We need good data from all that big data!
No benefit with more notes
Unigrams – of each line (LineUnigram)
Relative position of line in document (PosInDoc)◦ Top, Middle, Bottom
Known Header features (KnownHeader)◦ Find potential headers using repository of seen headers
◦ Seen headers can have canonical type E.g. Past Medical History, Previous Med History “PAST_MEDICAL_HISTORY”
◦ If potential headers found, we include features: Canonical type Unigram & Char n-gram of potential header Caps/colon info – mixed case, all caps, lowercase Length of potential header
Feature SetAvg.
Accuracy Improvement
LineUnigram 85.55%
LineUnigram+PosInDoc 88.62% +3.46%
LineUnigram+PosInDoc+KnownHeader 93.10% +4.81%
Explored:◦ Supervised approach to building a very general segmentation
model for clinical texts
Evaluation showed:◦ Model works well on notes across enterprises◦ Model works across note types
Key to effectiveness:◦ Variation in training data –all sorts of note types◦ Training data selection strategy – sample and cross-validate◦ Feature set – not explored in existing works
Contact:Kavita Ganesanganesan.kavita@gmail.comwww.kavita-ganesan.comwww.text-analytics101.com