information extraction from medical records by alexander barsky
TRANSCRIPT
Current Methodology:
Broad assessment of patient contained in beginning of chart with references to more specific areas. Specific divisions follow broad assessment. Records are listed in chronological order of activity.
Problem:
A patient's medical chart is very detailed and very complex in nature. Any attempt to quickly locate specific information will be met with frustration.
Solution:
Create a system that properly extracts wanted information based on a predefined set of parameters. Example: "Hormonal imbalance during puberty". Retrieve all references to hormonal imbalances but only between two specific time periods in medical chart.
Tool At our disposal:
JAPE : Java Annotation Patterns Engine. Use : pattern matching and semantic extraction GATE : General Architecture for Text Engineering. Use: Information Extraction, document annotation, and XML output. C# : Visual C# Winforms. Use: Medium for conversion between XML and .csv file formats.
Solution Methodology:
1. Create corpus of documents in GATE.2. Introduce rules for information extraction.3. Annotate documents in corpus.4. Output annotated documents in XML.5. Strip file of unnecessary elements and convert to .csv.
ANNIE
A-Nearly-New-Information-Extraction-System -Tokeniser - splits sentence into simple tokens-Gazetter - identify entity names contained in lists-Sentence Splitter - splits text into sentences based on lists.-Parts of Speech Tagger - identifies text as different POS.-Coreference Matcher- identifies relationships between previously defined entities.
Problem: Too much unorganized information.
Solution :
XLST to the rescue!!!
XLST - Extensible Stylesheet Language Transformations - Add specific rules to seperate needed from unnecessary information.
CSV File Type Comma Seperated Value - Used to present information in a tabular system. Useful for analyzing large amount of data in an easy to understand format. Most common program to use it is Excel.
.
Potential Problem:
Regardless of how well all the ANNIE tools are utilized and how well the JAPE rules are defined, proper recall precentage won't ever be exact.
Solution: Machine Learning
Machine learning is our best chance to increase precision of output results. Training a computer to recognize commonally used reporting phraseology will organize extraction better with more precise, concise outputs. Lucky for us, GATE include plugins to program machine learning.