extracting medical attributes and finding relations

27
Extracting Medical Attributes and finding relations Sanghamitra Deb Accenture Technology Laboratory

Upload: sanghamitra-deb

Post on 17-Feb-2017

184 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Extracting medical attributes and finding relations

Extracting Medical Attributes and finding

relations

Sanghamitra Deb Accenture Technology Laboratory

Page 2: Extracting medical attributes and finding relations

drugs

side effects

Personalized Medicine

ethnicity

dosages

diseases

age group

compounds

gender

interactions

?

?

?

Page 3: Extracting medical attributes and finding relations

FDA Drug Labels

Page 4: Extracting medical attributes and finding relations

It is indicated for treating respiratory disorder caused due to allergy.

For the relief of symptoms of depression.

Evidence supporting efficacy of carbamazepine as an anticonvulsant was derived from active drug-controlled studies that enrolled patients with the following seizure types:

LOTEMAX is a corticosteroid indicated for the treatment of post-operative inflammation and pain following ocular surgery.

FDA Drug Labels: Examples

Page 5: Extracting medical attributes and finding relations

We present a case of a 10-year-old boy who had severe relapsing pancreatitis three times in two months within 3 weeks after starting treatment with methylphenidate ( ritalin ) due to attention deficit hyperactivity disorder (adhd). The boy was generally healthy except for that he was newly diagnosed with adhd and started the use of methylphenidate ( ritalin ) for the past three weeks at a dose, of 30 mg daily. We believe that the number of persons suffering from pancreatitis due to the use of ritalin is more than this published case. Physicians must pay attention regarding this possible complication and it should be taken into consideration in every patient with abdominal pain who started consuming ritalin.

Meta Data

Dosage single dose: 240 ml

Drug methylphenidate

# of vol 30mg

Clinical Trials: Meta Data

Page 6: Extracting medical attributes and finding relations

We present a case of a 10-year-old boy who had severe relapsing pancreatitis three times in two months within 3 weeks after starting treatment with methylphenidate ( ritalin ) due to attention deficit hyperactivity disorder (adhd). The boy was generally healthy except for that he was newly diagnosed with adhd and started the use of methylphenidate ( ritalin ) for the past three weeks at a dose, of 30 mg daily. We believe that the number of persons suffering from pancreatitis due to the use of ritalin is more than this published case. Physicians must pay attention regarding this possible complication and it should be taken into consideration in every patient with abdominal pain who started consuming ritalin.

Drug Adverse Effects

Ritalin pancreatitis,abdominal pain

Tylenolnausea, upper stomach pain, itching, loss of appetite

Aspirinrash, gastrointestinal ulcerations, abdominal pain, upset stomach, heartburn

Clinical Trials: Side Effects

Page 7: Extracting medical attributes and finding relations

Drug—Disease

• Of Label Drug Uses

• Database completion

• Design of clinical trials

relationship between meta- data

• How does heart disease correlate with gender and age.?

• Which universities have the most successful clinical trails for breast cancer?

• How are genes and phenotypes related?

• What dosage for ritalin was most effective in treating ADHD with least side effects?

Problems it Solves

Page 8: Extracting medical attributes and finding relations

8000 drug - disease treatment relationships from UMLS data

drug_name:’metipred|methylprednisolone|methylprednisolone preparation|methylprednisolonum|6alpha-methylprednisolone|6-alpha-methylprednisolone preparation|methylprednisolone|pregna-1,4-diene-3,20-dione, 11,17,21-trihydroxy-6-methyl-, (6alpha,11beta)-|(6alpha,11beta)-11,17,21-trihydroxy-6-methylpregna-1,4-diene-3,20-dione|methylprednisolone|meprdl|methylprednisolone|6-methylprednisolone|6 methylprednisolone'

disease_name: 'respiratory distress syndrome, acute|pulmonary capillary leak syndrome|wet lung syndrome|acute respiratory distress syndrome|shock lung|adult respiratory distress syndrome|shock lung|human ards|adult respiratory distress syndrome|wet lung|ards - adult respiratory distress syndrome|acquired respiratory distress syndrome|adult rds|ards|adult respiratory syndrome|a.r.d.s.|danang lung|danang lung|respiratory distress syndrome|adult respiratory distress syndrome, ards|shock lung|respiratory distress syndrome, adult|adult respiratory distress syndrome|vietnam lung|rds|lung, shock|adult hyaline membrane disease|ards, human|adult respiratory distress syndrome|adult hyaline membrane disease|ardss, human|a r d s|adult rds|congestive atelectasis|ards|respiratory distress syndrome|respiratory distress syndrome, adult|adult respiratory distress syndrome’

Training Data

Page 9: Extracting medical attributes and finding relations

Extract sentences that contain the specific attribute

POS tag and extract unigrams,bigramsand trigrams centered on nouns

Extract Features: words around nouns: bag of words/word vectors, position of the noun.

Train a Machine Learning model to predict which unigrams,bigrams or trigrams satisfy the specific relationship: for example the drug-disease treatment relationship.

Map training data to create a balanced positive and negative training set.

Course of Action

Page 10: Extracting medical attributes and finding relations

Creating Labelled Datalemmatized_sentence: [‘maintenance’, ‘therapy','reduce','the','frequency','of', ‘manic', 'episode', 'and', 'diminish', 'the', 'intensity', 'of', 'those', 'episode', 'which', 'may', 'occur', '.']

Several CandidatesTypically one of them is the disease that the drug treats. For every drug we create a training data. One line of the text produces 5 lines of training data with one true positive.

Balancing the Training DataSince the training data contains a higher percentage of zero’s than one’s it is important to balance it before modeling, i.e in order to build the model I choose equal number of zeros and ones.

Candidate Target rule-

predictionmainten

ance 0 1

therapy 0 1

manic episode 1 1

intensity 0 1

episode 0 1

Page 11: Extracting medical attributes and finding relations

Feature Extraction: Word Vectors, Disease Combinations

adhd + manic episode = bipolar disorderrespiratory disorder+allergy=common cold

coronary artery+heart disease=angina pectoris

high blood pressure+lipid=diabetes_management

Extract Features: Initialize vocabulary with pre-trained vectors

gensim: Train word2vec on medical corpus with unigrams, bi-grams and trigrams

Produce word vectors

Page 12: Extracting medical attributes and finding relations

Pure Python stack

pandas

scikit-learn

gensim

stanford-nlp-parser

pipeline = Pipeline([ ('union', FeatureUnion( transformer_list=[ # Pipeline for getting the position of the disease candidate ('position', Pipeline([ ('selector', ItemSelector(column='candidate')), ('vect', DictVectorizer()), ])), # Pipeline for getting words around candidates

('words_around', Pipeline([ ('selector', ItemSelector(column='words_around')), ('count', CountVectorizer()), ])) ])), ('clf', ML_library(penalty=‘l1'))])

Page 13: Extracting medical attributes and finding relations

Data Cleaning and Tokenization

Machine Learning Workflow: Pure Python stack

pandas

scikit-learn

gensim

stanford-nlp-parser

Feature Extraction/Candidate Selection Create Labelled Data

ML: Logistics Regression, …

HyperParameter Tuning

Calculate Metrics: precision, recall, ROC curve, etc

Page 14: Extracting medical attributes and finding relations

Results: Examples

drug-name disease candidate Candidates ML

Lithium Carbonate

bipolar disorder 1 1

Lithium Carbonate individual 1 0

Lithium Carbonate maintenance 1 0

Lithium Carbonate manic episode 1 1

Page 15: Extracting medical attributes and finding relations

Drug Candidate Target Predict

Silver Sulfadiazine

third degree 0 0

Silver Sulfadiazine sepsis 0 1

Silver Sulfadiazine burn 0 1

Silver Sulfadiazine cream 0 0

Drug Candidate Target Predict

Diltiazem Hydrochlori

despasm 1 0

Diltiazem Hydrochlori

de

coronary artery 1 0

Diltiazem Hydrochlori

de

stable angina 0 0

Diltiazem Hydrochlori

deangina 0 0

'silver sulfadiazine cream usp 1 % be a topical antimicrobial drug indicate as a adjunct for the prevention and treatment of wound sepsis in patient with second and third degree burn .’

[‘Diltiazem', ‘hydrochloride', ‘tablet','USP', 'be', ‘indicate', 'for', 'the', ‘management', 'of', 'chronic', 'stable', 'angina', 'and', ‘angina', 'due', ‘to', ‘coronary', 'artery', 'spasm', '.']

Cases where it does not work

Page 16: Extracting medical attributes and finding relations

Exploring Modeling Technique

Method Precision Recall F1 ROC Curve

Logistic Regression 0.95 0.95 0.95 0.92

LR+ word2vec 0.94 0.94 0.94 0.9

SVM 0.96 0.95 0.95 0.92

Random Forest 0.96 0.96 0.96 0.9

Page 17: Extracting medical attributes and finding relations

Clinical Trials Data

We present a case of a 10-year-old boy who had severe relapsing pancreatitis three times in two months within 3 weeks after starting treatment with methylphenidate ( ritalin ) due to attention deficit hyperactivity disorder (adhd).

The boy was generally healthy except for that he was newly diagnosed with adhd and started the use of methylphenidate ( ritalin ) for the past three weeks at a dose, of 30 mg daily.

We believe that the number of persons suffering from pancreatitis due to the use of ritalin is more than this published case.

Physicians must pay attention regarding this possible complication and it should be taken into consideration in every patient with abdominal pain who started consuming ritalin.

Page 18: Extracting medical attributes and finding relations

Clinical Trials Data: Labelled Data

Data Dosage Drug Treats Disease

Side Effects Age Gender Ethnicity duration

10-year-old 0 0 0 0 1 0 0 0

pancreatitis-ritalin 0 0 0 1 0 0 0 0

adhd-ritalin 0 0 1 0 0 0 0 0

ritalin 0 1 0 0 0 0 0 0

30 mg 1 0 0 0 0 0 0 0

past three weeks 0 0 0 0 0 0 0 1

boy 0 0 0 0 0 1 0 0

Page 19: Extracting medical attributes and finding relations

Clinical Trials Data: Labelled Data Exist

Data Dosage Drug Treats Disease

Side Effects Age Gender Ethnicity duration

10-year-old 0 0 0 0 1 0 0 0

pancreatitis-ritalin 0 0 0 1 0 0 0 0

adhd-ritalin 0 0 1 0 0 0 0 0

ritalin 0 1 0 0 0 0 0 0

30 mg 1 0 0 0 0 0 0 0

past three weeks 0 0 0 0 0 0 0 1

boy 0 0 0 0 0 1 0 0

Page 20: Extracting medical attributes and finding relations

Creating Labeled Data

Hand Label data that contain the specific attribute ~100

Extract Candidates: POS tag and extract unigrams,bigrams and trigrams centered on nouns

Generate rules: Automatic creation of labels that satisfy the 100 hand labelled data

This process will create a smaller sample (say 5-10%) of data which can be further crowdsourced for 100% accurate gold sample

Rule Based Model : with 95% accuracy

Iterate: Repeat process a few times

Page 21: Extracting medical attributes and finding relations

Example of rules: Dosage: (1) Sentence contains numbers (2) Distance between numbers and “mg”, “milligrams” <5 characters (3)Contains the word “dose”

Age: (1) Sentence contains numbers (2)Contains the word “age”, “year-old” within 5 words of the candidate

Page 22: Extracting medical attributes and finding relations

Deepdive: Extracting relationships between entities

pdf’s, textfiles, semistuctured json, example: journals available at pubmed and clinicaltrails.gov

Provide examples of data that need to be extracted

Structured data

Page 23: Extracting medical attributes and finding relations

Deepdive: Prototyping with ddlite

https://github.com/HazyResearch/ddlite

Page 24: Extracting medical attributes and finding relations

Deepdive: Prototyping with ddlite

Page 25: Extracting medical attributes and finding relations

Mind Tagger

Show ipython notebook

Page 26: Extracting medical attributes and finding relations

• NLP relationship extraction with ML techniques are very successful in presence of gold labeled data

• It is very important to invest time and resources towards harvesting good training data.

• There is an enormous amount data in pharma (clinical trials, laboratory notes, doctors notes, drug manufacturing documents,…). In order to pursue personalized medicine it is important to centralize this and make joint inferences across all data sets.

Final Remarks

Page 27: Extracting medical attributes and finding relations

Thank You: We are hiring …

blog: https://medium.com/@sangha_deb @sangha_deb,[email protected]