extracting medical attributes of published clinical trials

1
Extracting Medical Attributes of Published Clinical Trials Knowledge Representation of Textual Data Sanghamitra Deb, [email protected],@sangha_deb Accenture Technology Laboratory Motivation Currently used text Extraction Techniques Deepdive Results: Working with Deepdive Acknowledgements : I would like thank Professor Chris Re, Jaeho Shin, Jason Fries and Theodoros Rekatsinas for personally assisting me to use Deepdive for knowledgee extraction. This work is funded and supported by Accenture Technology Laboratory. I would also like to thank Joshua Neyland and Hyon S Chu for supporting the project. Conclusions Extracting Data from documents manually is a common practice in many domains such as Pharma, Legal Investigations, financial transactions etc. We will concentrate on Clinical Trials for this study. Understanding relationships between attributes is an important part of Clinical trials. This process involves several tens of hours of a skilled researcher. The goal of this research is automate manual knowledge extraction for Examples of meta data extraction: “disease studied”, “drugs X cures disease Y”, “drug X contains compound Y”, “age group of participants”, “dosage”, “gender”, … Deepdive creates structure (SQL Databases) from unstructured data(text) It is a data management system that tackle’s extraction, integration, and prediction problems in a single system. DeepDive asks the developer to think about features—not algorithms. In DeepDive's joint inference based approach, the user only specifies the necessary signals or features. Users can write simple rules based on domain knowledge that inform the inference (learning) process and provide feedback to improve predictions. Tedious training of each prediction not required, distant learning with a small training set works well. Model Process Use Case Rule Based + Manual Curating. Parse and Token Composite Rules Filtering Rules Manual inspection Semi Structured reports/ articles (financial reports, medication labels, etc). Small number of Documents (<100). Medium level precision is enough. Supervised ML: Logistic Regression /SVM/ Naive Bayes/RF/ Classificati on… Transform business problem into prediction variable. Generate Features: (bag of words, ngrams, vectorization,wordnet,…) Get Training Data. Feed into ML pipeline. Converting unstructured text to structured features and prediction variables is simple. Training Data is available. Example: sentiments associated with products from reviews with training data from ratings. Un Supervised ML: Clustering, Topic Modeling, Word2vec The parsed data is used as input to un- supervised techniques. Most unsupervisedd techniques are used to extract hidden facts in data. Training Data Not Available. Exact precision measurements are not important. Results are coherent themes or synonymous ideas in the corpus Factor Graphs Related Attributes Methylphe nidate, sold under the trade names Ritalin Candi- date Tagging Supervision Learning and Inference Drug Treats Disease Ritalin ADHD Tylenol fever Aspirin migrane FeedBack Human Feedback Knowledgebase: Drugs - Disease Ibuprofen, from isobutylphenylprop anoic acid, is a nonsteroidal anti- inflammatory drug (NSAID) used for treating pain, fever , and inflammation Unstructured Information Candidates (Rule Based: High Recall) drug disease ibuprofen pain ibuprofen fever ibuprofen inflammation ibuprofen renal failure ritalin cancer ritalin ADHD drug disease ibuprofen pain ibuprofen fever Ritalin ADHD Training Data from FDA and Manual Curation Input Data User Created Data Bases sentences POS,NER, etc Ibuprofen, from isobutylphenylpropanoic acid, is a nonsteroidal anti-inflammatory drug (NSAID) used for treating pain. Methylphenidate, sold under the trade names Ritalin … NLP Parsed Sentences Distant Supervison References We have successfully extracted attributes and relation between attributes. Deepdive is well suited for this purpose since it learns by collecting evidence from the entire corpus and is able to infer complex relationships in data. The final result is a structured data base that has been created from hunderds of gigabytes of texts from journal articles. Mindbender: Browsing Input Data loaded to deepdive Mindtagger: assists data labeling tasks to quickly assess the precision and/or recall of the extraction Extract data from sentences and create user defined functions (rules, heuristic schemes) to extract mentions of drug, compounds or diseases. Extract features from data set based on domain knowledge and deepdive guided generic feature set. Write inference rules, weights, calibration and holdout parameters Provide feedback from calibration plots and Mindtagger outputs and repeat the steps above DeepDive is project led by Christopher Ré at Stanford University. Current group members include: Michael Cafarella, Xiao Cheng, Raphael Hoffman, Dan Iter, Thomas Palomares, Alex Ratner, Theodoros Rekatsinas, Zifei Shan, Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. All materials are found at http://deepdive.stanford.edu/ The Artificial Intelligence group at Accenture Technology Laboratory is collaborating with Professor Chris Re’s group to incorporate intelligent language understanding to facilitate client delivery. https://www.accenture.com/us-en/accenture- technology-labs-index DeepDive: A Data Management System for Automatic Knowledge Base Construction. Ce Zhang.Ph.D. Dissertation, University of Wisconsin-Madison, 2015. Incremental Knowledge Base Construction Using DeepDive Sen Wu, Ce Zhang, Christopher De Sa, Jaeho Shin, Feiran Wang, and C. Ré. VLDB. 2015. Learning & Inference drug disease related ibuprofen pain T ibuprofen fever T ibuprofen inflammation T ibuprofen renal failure F ritalin cancer F Deepdive creates factor graphs and inferencing is done using Gibbs Sampling. Joint inferencing process ensures that priors are not accepted as ground truth. Uncertainty of one event influences other events. probability of 0.99 implies there is a 99% chance that the drugs and compounds are related Methylphenidate, sold under the trade names Ritalin among others, is a central nervous system (CNS) stimulant of the phenethylamine[3] and piperidine classes that is used in the treatment of attention deficit hyperactivity disorder (ADHD) and narcolepsy. Methylphenidate has been studied and researched for over 50 years and has a very good efficacy and safety record for the treatment of ADHD.[4] Brand name Compound Disease

Upload: sanghamitra-deb

Post on 13-Apr-2017

250 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Extracting Medical Attributes of Published Clinical Trials

Extracting Medical Attributes of Published Clinical Trials Knowledge Representation of Textual Data

Sanghamitra Deb, [email protected],@sangha_deb

Accenture Technology Laboratory

Motivation

Currently used text Extraction Techniques

Deepdive Results: Working with Deepdive

Acknowledgements : I would like thank Professor Chris Re, Jaeho Shin, Jason Fries and Theodoros Rekatsinas for personally assisting me to use Deepdive for knowledgee extraction. This work is funded and supported by Accenture Technology Laboratory. I would also like to thank Joshua Neyland and Hyon S Chu for supporting the project.

Conclusions

➢ Extracting Data from documents manually is a common practice in many domains such as Pharma, Legal Investigations, financial transactions etc. We will concentrate on Clinical Trials for this study.

➢ Understanding relationships between attributes is an important part of Clinical trials. This process involves several tens of hours of a skilled researcher. The goal of this research is automate manual knowledge extraction for

➢ Examples of meta data extraction: “disease studied”, “drugs X cures disease Y”, “drug X contains compound Y”, “age group of participants”, “dosage”, “gender”, …

❑ Deepdive creates structure (SQL Databases) from unstructured data(text)

❑ It is a data management system that tackle’s extraction, integration, and prediction problems in a single system.

❑ DeepDive asks the developer to think about features—not algorithms. In DeepDive's joint inference based approach, the user only specifies the necessary signals or features.

❑ Users can write simple rules based on domain knowledge that inform the inference (learning) process and provide feedback to improve predictions.

❑ Tedious training of each prediction not required, distant learning with a small training set works well.

ModelProcess Use Case

Rule Based + Manual Curating.

Parse and Token

Composite Rules

Filtering Rules

Manual inspection

Semi Structured reports/articles (financial reports, medication labels, etc). Small number of Documents (<100). Medium level precision is enough.

Supervised ML: Logistic Regression/SVM/Naive Bayes/RF/Classification…

Transform business problem into prediction variable. Generate Features: (bag of words, ngrams, vectorization,wordnet,…) Get Training Data. Feed into ML pipeline.

Converting unstructured text to structured features and prediction variables is simple. Training Data is available. Example: sentiments associated with products from reviews with training data from ratings.

Un Supervised ML: Clustering, Topic Modeling, Word2vec

The parsed data is used as input to un-supervised techniques. Most unsupervisedd techniques are used to extract hidden facts in data.

Training Data Not Available. Exact precision measurements are not important. Results are coherent themes or synonymous ideas in the corpus

Factor Graphs

Related Attributes

Methylphenidate, sold under the trade names Ritalin …

Candi-date Tagging

SupervisionLearning and Inference

Drug Treats Disease

Ritalin ADHD

Tylenol fever

Aspirin migrane

FeedBackHuman Feedback

Knowledgebase: Drugs - Disease

Ibuprofen, from isobutylphenylpropanoic acid, is a nonsteroidal anti-inflammatory drug (NSAID) used for treating pain, fever, and inflammation

Unstructured Information

Candidates (Rule Based: High Recall)

drug disease

ibuprofen pain

ibuprofen fever

ibuprofen inflammation

ibuprofen renal failure

ritalin cancer

ritalin ADHD

drug disease

ibuprofen pain

ibuprofen fever

Ritalin ADHD

Training Data from FDA and Manual Curation

Input Data User Created Data Bases

sentences POS,NER, etc

Ibuprofen, from isobutylphenylpropanoic acid, is a nonsteroidal anti-inflammatory drug (NSAID) used for treating pain.

Methylphenidate, sold under the trade names Ritalin …

NLP Parsed Sentences

Distant Supervison

References

➢ We have successfully extracted attributes and relation between attributes.

➢ Deepdive is well suited for this purpose since it learns by collecting evidence from the entire corpus and is able to infer complex relationships in data.

➢ The final result is a structured data base that has been created from hunderds of gigabytes of texts from journal articles.

Mindbender: Browsing Input Data loaded to deepdive

Mindtagger: assists data labeling tasks to quickly assess the precision and/or recall of the extraction

Extract data from sentences and create user defined functions (rules, heuristic schemes) to extract mentions of drug, compounds or diseases. Extract features from data set based on domain knowledge and deepdive guided generic feature set. Write inference rules, weights, calibration and holdout parameters Provide feedback from calibration plots and Mindtagger outputs and repeat the steps above

❑ DeepDive is project led by Christopher Ré at Stanford University. Current group members include: Michael Cafarella, Xiao Cheng, Raphael Hoffman, Dan Iter, Thomas Palomares, Alex Ratner, Theodoros Rekatsinas, Zifei Shan, Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. All materials are found at http://deepdive.stanford.edu/

❑ The Artificial Intelligence group at Accenture Technology Laboratory is collaborating with Professor Chris Re’s group to incorporate intelligent language understanding to facilitate client delivery. https://www.accenture.com/us-en/accenture-technology-labs-index

❑ DeepDive: A Data Management System for Automatic Knowledge Base Construction. Ce Zhang.Ph.D. Dissertation, University of Wisconsin-Madison, 2015.

❑ Incremental Knowledge Base Construction Using DeepDive Sen Wu, Ce Zhang, Christopher De Sa, Jaeho Shin, Feiran Wang, and C. Ré. VLDB. 2015.

Learning & Inference

drug disease relatedibuprofen pain Tibuprofen fever Tibuprofen inflammation Tibuprofen renal failure Fritalin cancer F

Deepdive creates factor graphs and inferencing is done using Gibbs Sampling. Joint inferencing process ensures that priors are not accepted as ground truth. Uncertainty of one event influences other events.

probability of 0.99 implies there is a 99% chance that the drugs and compounds are related

Methylphenidate, sold under the trade names Ritalin among others, is a central nervous system (CNS) stimulant of the phenethylamine[3] and piperidine classes that is used in the treatment of attention deficit hyperactivity disorder (ADHD) and narcolepsy. Methylphenidate has been studied and researched for over 50 years and has a very good efficacy and safety record for the treatment of ADHD.[4]

Brand name

Compound

Disease