Download - Social Media Mining in the context of a pharmaceutical ... › cl › rinaldi › PRESENTATIONS › blah2020.pdf · Social Media Mining in the context of a pharmaceutical company,

Intro Projects

Social Media Mining in the context of a pharmaceuticalcompany, and other applications

Fabio RinaldiSUPSI/IDSIA and University of Zurich, Switzerland

Swiss Institute of BioinformaticsFondazione Bruno Kessler, Trento, Italy

February 4, 2020

Intro Projects

Projects

• Extraction of Information from the Scientific Literature• PsyMine (COGITO)• MelanoBase (SNF)• Author Name Disambiguation (Roche)

• Clinical records: SwissMADE (SNF/NRP74)

• Social Media: MedMon (InnoSuisse)

• Veterinary reports: collaboration with VetSuisse

• Assisted curation (NIH), collaboration with RegulonDB

• Tools and Resources: BioTermHub, OGER

http://www.ontogene.org/

http://www.ontogene.org/

Intro Projects

PsyMine

Intro Projects

From disorders to etiological factors

Intro Projects

Creation of a reference corpus

Intro Projects

MelanoBase

• Most serious type of skin cancer

• Develops from the pigment-containing cells (melanocytes)

• Primary cause of is UV exposure

Intro Projects

MelanoBase

Intro Projects

Author name disambiguation

Intro Projects

SwissMADE: The challenge of clinical text

[http://carecentra.com/clinical-notes-mining/]

• SwissMADE (Monitoring ofAdverse Drug Event)

• older patients (aged ≥ 65years)

• antithrombotic drugs

• using structured andunstructured parts of theEHRs

• involves five hospitals

Intro Projects

MedMon

• Bring patient insights into the lifecycle of pharmaceutical products:from development to surveillance(pharmacovigilance)

• Mining the web and social networks formentions of Adverse Drug Reactions

• Collaboration with a major PharmaCompany and another Swiss University

Intro Projects

VetMine

Intro Projects

Assisted curation

• The OntoGene/BioMeXT group has been active in assisted curation since 2010with the SASEBio project (Semi-Automated Semantic Enrichment of theBiomedical Literature).

• Since 2013 we are collaborating with the RegulonDB database in a project aimedat testing and gradually introduce assisted curation techniques in their curationpipeline.

• RegulonDB is a database of the regulatory network of Escherichia coli K-12.

Intro Projects

Intro Projects

Example

We additionally found that expression of the mntP gene is upregulated by manganesethrough MntR.

• Given: MntR [+] mntP

• To identify: condition [manganese]

Intro Projects

OxyR experiment

• TOPIC: oxidative stress by OxyR

• CORPUS: 46 papers, curated in RegDB

• METHODS: automated annotations of entitiesvia OntoGene, selection of sentences via ODINfilters, manual validation

• RESULTS: 100% of RIs retrieved, includingTF, EFFECT and their TG

• Identified the growth conditions for 15 of the20 Ris of OxyR checking only a limited set ofsentences (about 10% of the article is read)

[?]

Intro Projects

MedMon - Monitoring Social Media Content forPatient-related Information:Approaches and Challenges

Tilia Ellendorff, Fabio Rinaldi

Universitat ZurichInstitut fur Computerlinguistik

February 4, 2020

MedMon February 4, 2020 1 / 15

Introduction

The MedMon Project:Monitoring of internet resources for pharmaceutical research anddevelopment

Innosuisse project together with a big pharmaceutical companyand researchers from the University of Applied Sciences of theGrisons

Main motivations and goals:Discovering of unmet medical needs of rare disease patients

Assessment of patients’ perception of their specific disease burden

Early detection and tracking of epidemic outbreaks

Optimization of patient recruitment for clinical studies (location, agegroups, etc.)

Monitoring of conversations on clinical trials and treatments


MedMon: Data Sources

Use-cases:ParkinsonsMultiple SclerosisAngelmansViral Infections

Data Access via MedMonPortal:

TwitterRedditPatient Forums (e.g.parkinson.forum.org)

Languages:English, German, French,Spanish(but focus on English)


MedMon Processing Pipeline


Data Retrieval

Retrieval of relevant micro-posts from different data sourcesTwitterRedditMedical Forums

So far retrieval by keyword search: “Parkinson’s Disease”, “PD”,“Multiple Sclerosis”, “MS”Future: Disease-agnostic retrieval of micro-posts


Pre-processing and Removal of Duplicates

Domain-specific data pre-processingAims: reduce sparsity; remove bias introduces by duplicatesExamples:

User names and numbers are replaced with placeholders“@Jah423” → “@user”, “70 kids” → “NUMBER kids”URLs are truncated to their domain names“http://tinyurl.com/yereux6” → “http://tinyurl.com”Hash symbols are stripped from hash tags“#flu” → “flu”Camel-cased expressions are split into their component words“SideEffects” → “Side Effects”Frequent colloquial abbreviations are resolved to their full version“w/” → “with”Repetitions of letters (> 3) are replaced with a single or double letter“greaaaaat” → “great”, “freeeeeze” → “freeze”

Identification and removal of duplicate micro-postMedMon February 4, 2020 6 / 15

Spam-filtering and Filtering by Disease Mentions

Identification of Micro-post relevanceDoes it mention a disease?“Having a blast at Takin’ the Mic tonight with Karl Parkinson andCaelainn Hogan @CaelainnH https://t.co/L2jr4y5HSG. ”Spam vs. relevant content“Seratopicin Healing Pain Relief this cream incorporates all-natural,vegan components. [...] This product can be used for maximum ofyour commonplace muscle and joint pains.”


Personal Health-Mention Identification

Identification of Personal Health Mentions (PHM)Is the micro-post about a specific patient?Is the micro-post authored by the patient, by a relative or ahealth-care provider?

Participation in Social Media Mining for Health Shared Task(SMM4H 2019)

Task 4: Generalizable identification of personal health experiencementionsFocus of shared task: generalize from two given healthcontexts/health conditions (influenza vaccination and infection) inthe training set to three unknown health contexts in test set


SMM4H 2019:Our Best-Performing Approach to PHM Identification

BERT Classifier: pre-trained Bidirectional Encoder Representationsfrom Transformers, with a sequence classification head (tuned to thetask)

Merging of all model parameters into a single modelAverage: pointwise average of all model parametersWeighted Average: models are weighted by their performance onthe development fold (measured as F-score and transformed bySoftmax).


SMM4H 2019:Our Best-Performing Approach to PHM Identification

Weighted average of model parameters:1 x Merged Model from models trained on vaccination9 x Merged Model from models trained on infection(Sub-models were averaged)all models trained for 4 epochs


SMM4H 2019: Results

Results Task 4 (PHM Identification)MedMon February 4, 2020 11 / 15

Identification of Disease Mentions and Symptoms

Span-level annotation of Diseases/Disorders/HealthConditions (+ normalization)

Recognition of layman terminology“Angelman Syndrome”, “Angelmans”, “Angel”, “Angels”, “AS”

Span-level recognition of Symptoms (+ normalization)Recognition of layman terminologyRecognition of multi-word expressions“I’ve been having troubles falling asleep at night”Distinction from adverse drug reaction mentions (ADRs)“ One woman new to our support group never had a tremor untilrecently when she was given pain killer containing epinephrineand she has not stopped having tremors since that dentalappointment.”


Identification of Patients’ Attributes

Span-level recognition of Medication IntakeSpan-level recognition of Patients’ Attributes

Location: where is the patient locatedAge: age at post and current age; Estimated year of birthGender: female vs. male vs. third gender

Currently: dictionary look-up and regular expressions atmicro-post levelFuture: Patients’ attributes and medication intake should berecognized in relation to mention of patient


Conclusions: (Current) Challenges

Disease-agnostic retrieval of micro-postsIdentification of micro-post relevance: filtering of spam,identification of personal health mentionSpan-level recognition of Symptom Mentions: Symptoms vs.ADRsSpan-level recognition of Patients’ Attributes and Medication(Intake): Recognition in relation to patient


Sequence Tagging for Concept Recogni onOntoGene Tools and CRAFT shared task 2019

Lenz Furrer, Joseph Cornelius, Fabio Rinaldi

Ins tute of Computa onal Linguis csUniversity of Zurich

February 4, 2020

The CRAFT Shared Task 2019

CRAFT: the Colorado Richly Annotated Full-Text corpus• 67 ar cles for training (released 2012)• 30 ar cles for tes ng (released now)• annotated with bio en es, dependencies and coreferences• one sub-task each in the compe on

Task (CA)• Named En ty Recogni on (NER) and

Normalisa on (NEN)• 10 en ty types, plain + extended→ 20 separate evalua ons

Team: Lenz Furrer, Joseph Cornelius

2/21

Tradi onal Approach to NER and NEN: PipelineA common human skin tumour is caused by activating mutations in beta-catenin.

NERO O O B-dis E-dis O O O O O O S-chem

3/21

Tradi onal Approach to NER and NEN: PipelineA common human skin tumour is caused by activating mutations in beta-catenin.

NER

NEN

O O O B-dis E-dis O O O O O O S-chem

...MESH:D012876 Skin Disease, ParasiticMESH:D012877 Manifestation, SkinMESH:D012877 Skin ManifestationMESH:D012878 Neoplasm, SkinMESH:D012878 Skin NeoplasmMESH:D012878 Cancer of the SkinMESH:D012878 Cancer, SkinMESH:D012878 Skin CancerMESH:D012883 Skin UlcerMESH:D012883 Ulcer, SkinMESH:D012887 Fractures, Skull...

skin tumour

3/21

NEN: Sequence-to-Sequence

Negacy Degefa Hailu (2019). “Inves ga on of tradi onal and deep neural sequencemodels for biomedical concept recogni on”. PhD thesis. University of Colorado at Denver,Anschutz Medical Campus

4/21

Joint Tackling of NER and NEN

• avoid error propaga on• mutual benefit of feedback across tasks

NER

NEN

NER NEN

pipeline joint training

5/21

Mul -Task Learning for Joint Neural NER+NEN

Takotsubo syndrome secondary to Zolmitriptan

Embedding layer （word + character）

B-DISEASE D054549-P

∑ ∑

U V

I-DISEASE D054549-P

∑ ∑

U V

O NULL

∑ ∑

U V

O NULL

∑ ∑

U V

B-CHEMICAL C089750

∑ ∑

U V

Figure 3: The main architecture of our neural multi-task learning model with two explicit feedback strategies for MER andMEN. The character embedding is computed by CNN in Figure 2. Then the character representation vector is concatenatedwith the word embedding before feeding into the Bi-LSTM. Dashed arrows from the left to the right is the feedback from MERto MEN. Dashed arrows from the right to the left is the feedback from MEN to MER. Orange arrows indicate dropout layersapplied on both the input and output vectors of Bi-LSTM.

For a k-layer Bi-LSTM tagger for MER and MEN we get:

MER(w1:n, i) = yiMER = argmaxyiMER

= fMER(vki )

MEN(w1:n, i) = yiMEN = argmaxyiMEN

= fMEN (vki )

vki = F k

θ (x1:n, i)

x1:n = E(w1), E(w2), ..., E(wn)

where E as an embedding function mapping each wordin the vocabulary into a d-dimensional vector, yi

MER isthe log-probabilities vector with the length of MER tagspace, yiMER is the output tag of MER, yi

MEN is thelog-probabilities vector with the length of MEN tag space,yiMEN is the output tag of MEN, and vk

i is the output ofthe kth Bi-LSTM layer as defined above. All the parametersare trained separately for MER and MEN because we modelMER and MEN as different sequence labeling tasks.

Multi-task Mode with Explicit Feedback StrategiesThe dependencies between MER and MEN inspire us to ex-plore their potential mutual benefits. In order to make themost of the mutual benefits between MER and MEN, wepropose to feed the above mentioned Bi-LSTM and its vari-ants into multi-task learning framework with two explicitfeedback strategies, as shown in Figure 3. This method (1)is able to convert hierarchical tasks into parallel multi-taskmode while maintaining mutual supports between tasks; (2)benefits from general representations of both tasks providedby multi-task learning; (3) is effective in determining bound-aries of medical named entities through explicit feedbackstrategies thus improves the performance of both MER andMEN.

We experiment with a multi-task learning architecturebased on stacked Bi-LSTM, CNNs and CRF. Multi-tasklearning can be seen as a way of regularizing model in-duction by sharing representations with other inductions.

We use stacked Bi-LSTM-CNNs-CRF with task supervisionfrom multiple tasks, sharing Bi-LSTM-CNNs layers amongthe tasks.

MER and MEN are hierarchical tasks and their outputspotentially have mutual benefits for each other as well. Itmeans MEN can take MER results as input, while the re-sults of MEN can be also useful for MER. However, MERand MEN can be implemented independently as different se-quence tagging tasks. Therefore, we 1) follow the popularstrategy of multi-task learning to share representations be-tween MER and MEN; and 2) propose to use mutual feed-back between MER and MEN, i.e., the result of MER is fedinto the MEN as part of the input and the result of MEN isfed into the MER as part of the input. The multi-task learn-ing with two explicit feedback strategies for MER and MENis defined as:

MER(w1:n, i) = yiMER = argmaxyiMER

= fMER(vMERi )

MEN(w1:n, i) = yiMEN = argmaxyiMEN

= fMEN (vMENi )

vMERi = vk

i ◦ (vki + yi

MENU)

vMENi = vk

i ◦ (vki + yi

MERV)

vki = F k

θ (x1:n, i)

x1:n = E(w1), E(w2), ..., E(wn)

where fMER(vMERi ) is the MER multi-class classification

function and fMEN (vMENi ) the MEN multi-class classi-

fication function. vMERi is the input of MER multi-class

classification function, which combines the output of theshared stacked Bi-LSTM-CNNs and the explicit feedbackfrom MEN. vMEN

i is the input of MEN multi-class classi-fication function, which combines the output of the sharedBi-LSTM-CNNs and the explicit feedback from MER. Uis the matrix to map the feedback from MEN to MER, Vmaps the feedback from MER to MEN. You can consider

820

Sendong Zhao et al. (2019). “A Neural Mul -Task Learning Framework to Jointly ModelMedical Named En ty Recogni on and Normaliza on”. In: Proceedings of the Thirty-ThirdAAAI Conference on Ar ficial Intelligence (AAAI-19), pp. 817–824. DOI:10.1609/aaai.v33i01.3301817

6/21

https://doi.org/10.1609/aaai.v33i01.3301817

BiLSTM with Synchronous NER and NEN

ℎ→

�−2

ℎ �−2

��

�−2

S 13712

RanBP2

��−2 ��

�−2

��

�−2

13712

ℎ→

�

ℎ �

��

B 8608

Hexokinase

��

��

8608

ℎ→

�+1

ℎ �+1

��

�+1

E 8608

I

��+1 ��

�+1

��

�+1

8608

...

...

ℎ→

�+2

ℎ �+2

��

�+2

O NIL

activities

��+2 ��

�+2

��

�+2

NIL

embedd

ing

BiLST

MNER

NEN

ℎ→

�−1

ℎ �−1

��

�−1

O NIL

modulates

��−1 ��

�−1

��

�−1

NIL

...

...7/21

Fine-Tuning BioBERT

BioBERT

[CLS] Tok1 Tok2 TokN. . .

E[CLS] E1 E2 EN. . .

C T1 T2 TN. . .

NIL CHEBI:8608 NIL. . .

PubMed(1M)

BERTE[CLS] E1 E2 EN. . .

C T1 T2 TN. . .

Train two independent models:• span detector (NER)• ID tagger (NEN)

8/21

Unseen IDs: Pretraining (BiLSTM)

Pretraining• pretrain on ontology names→ top 1000 concepts only→ copy output label to feature input→ 20 epochs• then con nue training on corpus sentences

(6-fold cross-valida on, early stopping)

9/21

Unseen IDs: Back-Off (BERT)

O O

RanBP2 modulates Hexokinase I activitiesNIL NIL NIL

SPR:13712 PR:8608

PR:29546

RanBP2 modulates Hexokinase I activitiesPR:13712

PR:8608 PR:8608

PR:8608 PR:8608NIL NIL

B E

IDs

spans

OGER

predictions

10/21

OGER

[http://www.ontogene.org/resources/oger]

11/21

OGER: annota on service

The OntoGene’s Biomedical En ty Recogniser (OGER)• RESTful web service, using BTH terminologies• Allows annota on of a collec on of documents.• Evaluated in the Bio Text Mining services challenge BioCrea ve/TIPS

• best results according to several of the evalua on metrics.

[http://www.ontogene.org/resources/oger]

12/21

BioCrea ve V.5 / TIPS

13/21

Bio Term Hub

The Bio Term Hub is an aggregator of biomedicalterminologies, which is kept up-to-date byautoma cally integra ng content from manuallycurated data bases.[http://www.ontogene.org/resources/termdb]

14/21

Bio Term Hub

[http://www.ontogene.org/resources/termdb]15/21

Metrics

Slot Error Rate (Bossy et al. 2013)• count matches (M), inser ons (I),

dele ons (D), subs tu ons (S)→ subs tu ons: penalty for incorrect

boundaries and distance to correctontology entry

→ find op mal alignment

SER =S+ I+ D

N

Precision/Recall/F1• count matches (M) and par al

matches (Mp = 1− S)

Recall =M+Mp

N

Precision =M+Mp

P

N: # of ground-truth annota ons, P: # of predicted annota ons

16/21

Compe ng Systems

• Baselineplain dic onary-based system (OGER)

• BiLSTM• 6-fold CV, early stopping• + pretraining• + mul ple runs per fold, pick best

• BioBERT fine-tuned for 55 epochs• ID tagger• span tagger combined with OGER• combina on of the previous two (back-off)

17/21

Results: Legend

0.0

0.2

0.4

0.6

0.8

1.0

F10.0

0.2

0.4

0.6

0.8

1.0

SER

OGER (baseline)BiLSTMBiLSTM, pretrainedBiLSTM, pretrained, pick-bestBERT-IDsBERT-spans+OGERBERT-IDs+BERT-spans+OGER

18/21

Results

CHEBI0.00.20.40.60.81.0

F1

CL GO_BP GO_CC GO_MF MOP NCBITaxon PR SO UBERON

CHEBIEXT

0.00.20.40.60.81.0

F1

CLEXT

GO_BPEXT

GO_CCEXT

GO_MFEXT

MOPEXT

NCBITaxonEXT

PREXT

SOEXT

UBERONEXT

0.00.20.40.60.81.0

SER

0.00.20.40.60.81.0

SER

19/21

Examples: Correct Unseen IDs

BERT-IDs+BERT-spans+OGER predicts CHEBI_PR_EXT:somatostatin (twice):However, the somatosta n receptor 2 (SSTR-2) antagonist PRL-2903 does not inter-ferewith the ability of glucose (at 3 and7mM) to inhibit glucagon secre on frommouseislets [47].

BERT-IDs+BERT-spans+OGER predicts CHEBI:60004:Adult mouse testes were homogenized in a buffer containing 20 mM Tris, pH 7.5, 100mMKCl, 5mMMgCl2, 0.3%NP-40, 40U/ml of Rnasin ribonuclase inhibitor (Promega,Madison, WI), and a mixture of 10 protease inhibitors provided [...]

BiLSTM pick-best predicts PR:000008373:Decreased Osteogenic Differen a on Correlates with Abnormal Distribu on of Cx43

20/21

Unseen IDs: Precision/Recallunique occ. OGER pretraining BERT spans BERT back-off

CHEBI 110 447 0.33 / 0.65 1.00 / 0.00 0.74 / 0.47 0.70 / 0.11CHEBI_EXT 134 538 0.37 / 0.71 1.00 / 0.00 0.62 / 0.49 0.76 / 0.09CL 52 484 0.72 / 0.31 1.00 / 0.00 0.88 / 0.22 0.59 / 0.04CL_EXT 52 484 0.72 / 0.31 1.00 / 0.00 0.71 / 0.25 0.71 / 0.11GO_BP 120 484 0.21 / 0.25 1.00 / 0.00 0.56 / 0.12 0.66 / 0.06GO_BP_EXT 126 508 0.22 / 0.28 1.00 / 0.00 0.29 / 0.18 0.62 / 0.07GO_CC 32 184 0.19 / 0.35 1.00 / 0.00 0.50 / 0.17 0.49 / 0.06GO_CC_EXT 36 231 0.28 / 0.47 1.00 / 0.00 0.58 / 0.19 0.60 / 0.07GO_MF 1 1 0.10 / 0.50 1.00 / 0.00 1.00 / 0.00 1.00 / 0.00GO_MF_EXT 73 416 0.38 / 0.22 1.00 / 0.00 0.57 / 0.15 0.54 / 0.04NCBITaxon 40 87 0.02 / 0.50 1.00 / 0.00 0.40 / 0.34 0.75 / 0.22NCBITaxon_EXT 44 95 0.02 / 0.54 1.00 / 0.00 0.43 / 0.35 0.85 / 0.25PR 278 4782 0.26 / 0.86 0.63 / 0.00 0.81 / 0.74 0.69 / 0.15PR_EXT 309 5156 0.27 / 0.84 0.34 / 0.01 0.84 / 0.73 0.65 / 0.20SO 16 101 0.04 / 0.87 1.00 / 0.00 0.10 / 0.06 0.52 / 0.02SO_EXT 25 123 0.05 / 0.78 1.00 / 0.00 0.28 / 0.47 0.85 / 0.41UBERON 203 1297 0.47 / 0.33 0.69 / 0.00 0.74 / 0.25 0.59 / 0.06UBERON_EXT 207 1308 0.47 / 0.33 0.87 / 0.00 0.78 / 0.27 0.60 / 0.06

21/21

Intro Projects

Conclusions

• Solid, easy-to-use, efficient dictionary-based solution with constantly up-to-dateresources

• Bio Term Hub: a one-stop site for obtaining up-to-date biomedical terminologicalresources. http://www.ontogene.org/resources/termdb

• OGER: an efficient text annotation tool using the BTH terminologies. Providesspans and IDs (NER and CR) http://www.ontogene.org/resources/oger

• Integration with state-of-the-art disambiguation approaches for specificapplications

• Applications over several text types: literature, clinical records, social media

http://www.ontogene.org/https://github.com/OntoGene/craft-st

http://www.ontogene.org/resources/termdb

http://www.ontogene.org/resources/oger

Download - Social Media Mining in the context of a pharmaceutical ... › cl › rinaldi › PRESENTATIONS › blah2020.pdf · Social Media Mining in the context of a pharmaceutical company,

Top Related