![Page 1: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/1.jpg)
OCR and Text Analytics for Medical Chart Review Process
Alex ZeltovDarwin Leung Ravi ChawlaSomesh Nigam
![Page 2: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/2.jpg)
2
BIOGRAPHY
Alex Zeltov
Research Scientist, Advanced Analytics Independence Blue Cross Lead the development and research of Big Data initiative
and predictive analytics across the Informatics Division for Independence Blue Cross.
Contact Info:Phone:215.241.9885Email: [email protected]
![Page 3: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/3.jpg)
3
BIOGRAPHY
Darwin Leung
Director, Informatics Application Development and Operations
Independence Blue Cross Responsible for the development of analytical applications
across the Informatics Division for Independence Blue Cross.
Contact Info:Phone:215.241.2255Email: [email protected]
![Page 4: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/4.jpg)
Background on Text Analytics and Medical Documents
Providers have different levels of technology readiness – varying from Electronic Medical Records (EMR) to paper charts.
We want to apply text analytics to all information available for different business cases.
Need to bring all information collected to a level where our technologies can be applied.
![Page 5: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/5.jpg)
OCR for medical documents OCR (Optical Character Recognition) for medical
documents is useful because this software provides invaluable benefits in terms of cost savings and even increases productivity.
High Speed Provided by OCR
OCR software can provide very good accuracy rates as manual data entry but in a fraction of the time
![Page 6: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/6.jpg)
DB
OCR + Text Analytics Process
IMG/PDF/TIF DropBox (Share)
ImageMagic + OCR
HADOOP ClusterStore text
+pdf version of
EMR in HADOOP
Text Analytics / NLP processing
Results
Clinical Ontolog
y
Predictive Models
![Page 7: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/7.jpg)
Custom Distributed OCR Application:
High Performance distributed OCR process runs in the background, sharing resources with the Informatics Big Data HADOOP cluster.
Customized open source tools used in the OCR process:• Custom distribution and parrallezation framework for OCR• PDFtk: for normalizing pdf headers and splitting up the PDF pages
• ImageMagick: used to resize, rotate, increase dpi, apply various special effects to enhance quality of images. Creates an image version of the pdf (single page).
• Tesseract OCR:• extracts the text from a the image file and generates a text files• generate searchable pdfs by creating meta-data in original pdf
image files
![Page 8: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/8.jpg)
OCR Performance Statistics
Per Each Server Node:• Image Enhancement and Document Slicing + OCR: ≈ 2 sec/pg
• 1,800 pages/hr on 1 node
18 HADOOP Cluster Nodes that run in parallel OCR process:
• 32,400 pages/hr on cluster• Assuming typical chart 100 pages ≈ 324 charts/hr
![Page 9: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/9.jpg)
Text Analytics Components:Custom text analysis code using Java and Python
• Lucene – tokenization, shingles, n-gramming
• Weka - collection of machine learning algorithms for data mining.
• Advanced Query Language (AQL) - powerful text analytics engine developed by IBM and used by IBM Watson. Executes extractors in a highly efficient manner by using the parallelism provided by Informatics HADOOP platform.
• OpenNLP - hosts a variety of java-based NLP tools which perform sentence detection, tokenization, part-of-speech tagging, chunking and parsing, named-entity detection.
![Page 10: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/10.jpg)
10
Clinical Ontology DB Repo
Load Ontology Terms Per Medical
Condition
Tokenize
Stop Word Filters
Ngram / Shingles Stemming
Generate Token Permutations
Intermediate Ontology TokensPer Job Type
Hadoop GPFS
Ontology and Preprocessing
Hadoop Text Analytics
MRJobs
![Page 11: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/11.jpg)
HADOOP
• HADOOP framework is a mechanism for analyzing huge datasets, which do not have be housed in a datastore
• HADOOP scales out to myriad nodes and can handle all of the activity and coordination related to data processing.
• HADOOP Map Reduce is a way to process large data sets by distributing the work across a large number of nodes
.
![Page 12: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/12.jpg)
HADOOP Components:• Common – contains libraries and utilities needed by other
Hadoop modules.
• Hadoop Distributed File System (HDFS) – Distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the cluster.
– HDFS creates multiple replicas of each data block and distributes them on computers throughout a cluster to enable reliable and rapid access.
• MapReduce – a programming model for large scale data processing.
![Page 13: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/13.jpg)
HADOOP Components:• Hbase – is a distributed, column oriented NOSQL database.
• Hive – is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets.
• Sqoop – is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.
• Pig – Scripting platform .
• Oozie – Workflow scheduler.
• Zookeeper – Cluster coordination.
• Mahout – Machine learning library.
![Page 14: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/14.jpg)
Map Reduce
14
Map Reduce is a way to process large data sets by distributing the work across a large number of nodes• Map:
o Master node partitions the input into smaller sub-problemso Distributes the sub-problems to the worker nodeso Worker nodes may do the same process
• Reduce:o Master node then takes the answers to all the sub-problems o Combines them in some way to get the output
![Page 15: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/15.jpg)
Map Reduce - Word Count Example
http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.html
![Page 16: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/16.jpg)
Business Cases
Product Recall
Entity Extraction from Medical Charts
Nurse Chart Review Process
![Page 17: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/17.jpg)
Business Case 1: Product Recall
![Page 18: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/18.jpg)
• The text mining process helps identify the manufacturers that are on recall list.
• Scheduled report alerts with potential identified members that match the recall manufacturers.
• Create a database of extracted patient and manufacturer information.
• The OCR + Text mining process analyzes charts 300+ pages long on average
Business Case 1: Product Recall
![Page 19: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/19.jpg)
Business Case 1: Product Recall • Generated reports on the OCR results
• BigSheets - Web-based spreadsheet look and feel
![Page 20: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/20.jpg)
Business Case 1: Entity Extraction• Generated reports on the Entity Extraction results
• Create a database of extracted entity information accessible via jdbc/odbc.
![Page 21: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/21.jpg)
Business Case 2: Nurse Chart Review Process• The text mining process helps identify conditions and
diagnoses based on the medical ontology matches for the nurse review.
• The text analytics priorities the charts for nurse review, the highest scored EMR charts are presented first for the nurse review process.
• The nurse has the ability to open the text version of the chart that was created part of the OCR process to the exact location of the matched terms in the scanned version of chart.
![Page 22: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/22.jpg)
Summary
OCR software
It can operate at high speeds and often can process batches of medical documents in various formats (jpg, tiff, gif, pdf, etc.)
The text data can be stored in a database and then be used for analytics, predictive modeling and data mining
This technology provides invaluable benefits in terms of cost savings and productivity.
![Page 23: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/23.jpg)
Q & A
![Page 24: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/24.jpg)
Appendix
HADOOP Ecosystem
AQL
![Page 25: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/25.jpg)
HADOOPEcosystem
![Page 26: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/26.jpg)
AQL: Advanced Text Analytics • Powerful Text Analytics engine developed by IBM and used
by IBM Watson on the Jeopardy quiz show.
• A declarative Annotation Query Language (AQL) with familiar SQL-similar syntax for specifying text analytics extraction programs (or extractors) with rich, clean rule semantics.
• A runtime engine for executing extractors in a highly efficient manner by using the parallelism provided by the IBM InfoSphere BigInsights engine using HADOOP platform.
• Built-in multilingual support for tokenization and part-of-speech analysis.
• The text analytics system extracts information from unstructured and semi structured data.
![Page 27: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/27.jpg)
AQL
![Page 28: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/28.jpg)
Sample AQL
/* Dictionary of minor conditions */create dictionary minorConditionsfrom file 'minorConditions.dict'with language as 'en';
/* Dictionary of major conditions */create dictionary majorConditionsfrom file 'majorConditions.dict'with language as 'en';
/* Extract instances of minor conditions and 'score' 1 for each instance */create view minor as extract 1 as disposition, dictionary 'minorConditions' on R.text as matchfrom Document R;
/* Extract instances of major conditions and 'score' 2 for each instance */create view major as extract 2 as disposition, dictionary 'majorConditions' on R.text as matchfrom Document R;
/* Union together all instances */create view RawDisposition as (select * from minor)union all (select * from major); /* Aggregate per document score */create view ConsolidatedDisposition as select Sum(R.disposition) as dispositionfrom RawDisposition R;
export view ConsolidatedDisposition;
![Page 29: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/29.jpg)
Developing/Testing AQL query
![Page 30: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/30.jpg)
Entity Integration
![Page 31: Im symposium presentation - OCR and Text analytics for Medical Chart Review Process](https://reader035.vdocuments.us/reader035/viewer/2022070519/58f37b851a28ab763a8b456f/html5/thumbnails/31.jpg)
END