How can Natural Language Processing help MedDRA coding?April 16 2018
Andrew Winter Ph.D., Senior Life Science Specialist, Linguamatics
Summary
About NLP and NLP in life sciencesUses of NLP with MedDRAExamples in MedDRA coding of adverse events in FDA drug labelsHow NLP could feed into MedDRA development
© Linguamatics 20182
Use of NLP in Life Sciences
Advanced text analytics delivers value along the pipeline
© Linguamatics 20183
Gene-disease mapping
Target ID/selection
Mutation/expression analysis
Toxicity analysis and prediction
Biomarker discovery
Drug repurposing
Patent analysis
KOL identification
Opportunity scouting
Trial site selection and study design
Safety Competitive intelligence
Pharmacovigilance
Voice of the Customer analysis
Comparative Effectiveness
Regulatory Submission QC HEOR
SAR
Social media analysis
IDMP
Real World Evidence
NLP Turns Text into Actionable Insights
© Linguamatics 20184
Turn text Into structured datausing sophisticated queries
Analytics
To driveanalytics
Enterprise Warehouse
Natural Language Processing – Ontologies – Statistical Methods –Machine Learning - Chemistry –Regular Expressions – etc.
Transform unstructured or semi-structured data into insights to advance human health
NLP finds information however it is expressed
5 © Linguamatics 2018
Different word, same meaning
cyclosporineciclosporin
NeoralSandimmune
Different expression, same meaning
Non-smokerDoes not smoke
Does not drink or smokeDenies tobacco use
Different grammar, same meaning
5mg/kg of cyclosporine per day5mg/kg per day of cyclosporinecyclosporine 5mg/kg per day
Same word, different context
Diagnosed with diabetesFamily history of diabetes
No family history of diabetes
NLP
Blend of powerful rule- and machine learning-based methods to transform unstructured data into structured
• Precise linguistic relationships, sentence co-occurrence• Precise negation e.g. “pressure” not “blood pressure”• Multiple languages
Linguistic Processing
• Search for concepts and their synonyms with spelling and optical character recognition (OCR) correction
• Out of the box or custom ontologies
Terminologies/ Ontologies
• Quantitative & pattern-based data extraction at scale e.g. numerical data, dates, gene mutations
• Range searchQuantitative Data
• Identify and extract chemicals in context based on substructure and chemical similarityChemistry
• Ontology and rule-based normalization of results• Essential for organizing structured output• Enables indirect relations, filtering/faceting results, etc.
Results Normalization
• Unique capability to capture knowledge from tables embedded in documents
• Fielded search within regions of a document
Table & Region Processing
© Linguamatics 20186
Data normalization: always treat the same concept in the same way – the key to structured results
Concept Text Normalized ValueDiseases breast cancer Breast Neoplasm
carcinoma of the breast
Genes Raf-1 RAF1
Raf I
Dates 27th Feb 2014 20140227
2014/02/27
Measurements 0.2g 200 mg
Two hundred milligrams
Mutations Val 158 Met V158M
Val by Met at codon 158
Behaviours denies alcohol and tobacco use
Non-smoker
is not a cigarette smoker
Relationships ...nimesulide, a selective COX2 inhibitor, …
Entrez ID: 5743
inhibits
Data normalization
Overview• Convert text into a standard
format• Is a fundamental component
in transforming text into structured data and driving actionable insights
Key benefitsFind concepts however they
are expressedJoin results to discover new
indirect relationshipsCluster or facet results by
concept or quantityCompare measurements with
different units e.g. kg vs. lbs
© Linguamatics 20187
Use of NLP with MedDRA
Errors in Regulatory SubmissionsSocial MediaAdverse Events in Drug Labels
© Linguamatics 20188
Commonly reported conditions included Seasonal allergies, Back pain, and Hypercholesterolaemia. The majority of AEs were considered treatment related in all cohorts and the relationship between treatment groups and between cohorts was similar to that observed for all-causality AEs. Permanent discontinuations were reported at higher rates in the Rx groups than in the placebo groups in the 3 pooled cohorts. The majority of AEs leading to permanent discontinuation were considered treatment related in both treatment groups in all cohorts. The single most frequently reported event was headache, which was reported in approximately 40% of Rx subjects and 20% of placebo subjects in the 2000 Pooled cohort. Other AEs reported across all cohorts at rates greater in Rx subjects than placebo subjects included Seasonal allergies and Insomnia (2000 8.4% vs 5.4%, 2003 0.9% vs 0.8%, 2006 14.0% vs 10.1%; Rx vs placebo respectively).
Sample table and text highlighting, to show inconsistencies between data. The highlight colour makes it easy for the reviewer to rapidly assess where there are errors and what type of errors, and can then correct these appropriately.
Table: Most Frequently Reported Medical Conditions (≥5% in Any Treatment Group)
Study2000 Pooled
Studies2003 Pooled Study
Total Number Subjects
RxN=997
PboN=927
RxN=1021
PboN=956
Number (%) of SubjectsCardiac disorders 70
(7.0)32
(3..5)108
(10.6)101
(10.6)Angina pectoris 4
(0.4)5
(0.5)74
(7.2)71
(7.4)Dyspepsia 174
(17.5)120
(12.9)3
(0.3)2
(0.2)GERD 83
(8.3)52
(5.6)30
(2.9)27
(2.8%)Metabolic / nutritional disorders
253(25.4)
165(17.8)
194(19.0)
212(22.2)
Dyslipedaemia 1(0.1)
0(0)
15(1.5)
19(2.0)
Hypercholesterolaemia 65(6.5)
50(5.4)
88(8.6)
103(10.8)
Hyperlipidaemia 147(14.7)
79(8.5)
56(5.5)
66(6.9)
Osteoarthritis 102(10.2)
57(6.6)
12(1.2)
11(1.2)
Nervous system disorders
628(63.0)
409(44.1)
28(2.7)
19(2.0)
Headache 413(41.4)
280(30.2)
9(0.9)
7(0.7)
Psychiatric disorders 137(13.7)
81(8.7)
14(1.4)
15(1.6)
Insomnia 84(8.4)
47(5.1)
9(0.9)
8(0.8)
Key
Incorrect formatting: doubled period, incorrect number of decimal places, addition of percent signIncorrect calculation: number of patients divided by total number does not agree with percent termIncorrect threshold: presence of row does not agree with table titleText-Table inconsistency: numbers in the table do not agree with numbers in the accompanying text
© Linguamatics 20169
Use Case: Automated Blinded Data Review for Regulatory Submissions
Before unblinding a clinical trial, data are checked for errors and inconsistenciesAmong the many checks performed, MedDRA terms for Adverse Events Reports are verified, including:− Is the Preferred Term valid in any version of MedDRA?
Reporter may have inserted the Investigator Entry in the wrong field, or used an LLT
− Are multiple MedDRA versions in use in the same trial?Reporter Error or Error when generating the blinded data
− Does the specified version of MedDRA agree with the Preferred Terms being reported? Reporter may have used a more precise MedDRA term from a more recent version of MedDRA
− Does the Preferred Term agree with the declared System Organ Class?
Automation of this process is in use at large pharma
© Linguamatics 201810
Use Case: Social Media Analysis
© Linguamatics 201811
Social media: plenty of AEs mentionedLanguage informalLinguistic patterns can find mentions of AEs without using a dictionary Using MedDRA LLTs finds only one of the following 4 examples
Use Case: Extraction of Adverse Events using MedDRA
Extraction of adverse events, MedDRA terms and frequency of occurrence, clustered by medicinal product Structured results can be used to populate a database, e.g. IDMP− Different customers have different MedDRA requirements, e.g. PT vs LLT, which is easy to accommodate
© Linguamatics 201812
Results table (background) and highlighted source document (foreground) are shown
Extraction of AEs from FDA Drug Labels
FDA drug labels are not structuredWant to compare AEs found in Real World Evidence with known AEsFind AEs from within text, and within tables
Pull out values if want to filter to only include AEs where greater than placebo
© Linguamatics 201813
Use of NLP terminology features in extracting AEs
© Linguamatics 201814
Increase recall with:− Morphological variants
− Spelling correction
− Matching across conjunctions
− Mapping multiple concepts to MedDRA PT
Increase precision with:− Excluding inappropriate contexts
− Use of document sections to exclude inappropriate terms
Increase recall: morpho variants
© Linguamatics 201815
MedDRA PT “Congenital anomaly”
*
*
*
*
*
** Additional hits when using morphological variants
Increase recall: MedDRA matching across conjunctions
© Linguamatics 201817
MedDRA PT “Hepatic neoplasm” OR “Thyroid neoplasm
Increase recall: mapping multiple concepts to a MedDRA PT
© Linguamatics 201818
MedDRA PT “Blood creatinineincreased”
•Blood creatinine increased •Creatinine blood increased •Creatinine high •Creatinine increased •Creatinine serum increased •Increased serum creatinine•Plasma creatinine increased •Raised serum creatinine•Serum creatinine increased
has low recall.
Combining MedDRA PT “Blood creatinine”
•Blood creatinine•Creatinine•Plasma creatinine•Serum creatinine
with Relation “Increase”•Increase•Elevate•Raise•...
in a linguistic pattern allowing flexibility in expression... gives significant additional recall (*).
*
*****
**
**
Increase precision: exclusion of hits in inappropriate contexts when searching for adverse events
© Linguamatics 201819
Thousands of examples of MedDRA concepts that are not AEs. Linguistic patterns can filter out inappropriate contexts.
Increase precision: using document regions - exclusion of PTs that occur in Indications when searching for AEs
© Linguamatics 201820
Can be removed based on same PT
How NLP could feed into MedDRA development: improved coverage of terminology
© Linguamatics 201821
Terms appearing with MedDRA terms in the same listExplicit constructions such as “AEs such as”, or from tablesLook for terms in appropriate contexts e.g. “made me ?”
Noun phrases occurring in a list after “adverse events such as”, and which are not already in MedDRA
© Linguamatics 201822
Summary
NLP is required to rule out inappropriate contexts, improving precisionNLP techniques e.g. Morphological variants and OCR correction improve recallString based synonym matching cannot cope with all the variation found in real text, e.g. Elevation of blood creatinine. Here Linguistic patterns are required.Region and table processing are often required to get the right context.
© Linguamatics 201824