Intro Projects
Social Media Mining in the context of a pharmaceuticalcompany, and other applications
Fabio RinaldiSUPSI/IDSIA and University of Zurich, Switzerland
Swiss Institute of BioinformaticsFondazione Bruno Kessler, Trento, Italy
February 4, 2020
Intro Projects
Projects
• Extraction of Information from the Scientific Literature• PsyMine (COGITO)• MelanoBase (SNF)• Author Name Disambiguation (Roche)
• Clinical records: SwissMADE (SNF/NRP74)
• Social Media: MedMon (InnoSuisse)
• Veterinary reports: collaboration with VetSuisse
• Assisted curation (NIH), collaboration with RegulonDB
• Tools and Resources: BioTermHub, OGER
http://www.ontogene.org/
Intro Projects
PsyMine
Intro Projects
From disorders to etiological factors
Intro Projects
Creation of a reference corpus
Intro Projects
MelanoBase
• Most serious type of skin cancer
• Develops from the pigment-containing cells (melanocytes)
• Primary cause of is UV exposure
Intro Projects
MelanoBase
Intro Projects
Author name disambiguation
Intro Projects
SwissMADE: The challenge of clinical text
[http://carecentra.com/clinical-notes-mining/]
• SwissMADE (Monitoring ofAdverse Drug Event)
• older patients (aged ≥ 65years)
• antithrombotic drugs
• using structured andunstructured parts of theEHRs
• involves five hospitals
Intro Projects
MedMon
• Bring patient insights into the lifecycle of pharmaceutical products:from development to surveillance(pharmacovigilance)
• Mining the web and social networks formentions of Adverse Drug Reactions
• Collaboration with a major PharmaCompany and another Swiss University
Intro Projects
VetMine
Intro Projects
VetMine
Intro Projects
VetMine
Intro Projects
VetMine
Intro Projects
Assisted curation
• The OntoGene/BioMeXT group has been active in assisted curation since 2010with the SASEBio project (Semi-Automated Semantic Enrichment of theBiomedical Literature).
• Since 2013 we are collaborating with the RegulonDB database in a project aimedat testing and gradually introduce assisted curation techniques in their curationpipeline.
• RegulonDB is a database of the regulatory network of Escherichia coli K-12.
Intro Projects
Intro Projects
Example
We additionally found that expression of the mntP gene is upregulated by manganesethrough MntR.
• Given: MntR [+] mntP
• To identify: condition [manganese]
Intro Projects
OxyR experiment
• TOPIC: oxidative stress by OxyR
• CORPUS: 46 papers, curated in RegDB
• METHODS: automated annotations of entitiesvia OntoGene, selection of sentences via ODINfilters, manual validation
• RESULTS: 100% of RIs retrieved, includingTF, EFFECT and their TG
• Identified the growth conditions for 15 of the20 Ris of OxyR checking only a limited set ofsentences (about 10% of the article is read)
[?]
Intro Projects
Intro Projects
Intro Projects
MedMon - Monitoring Social Media Content forPatient-related Information:Approaches and Challenges
Tilia Ellendorff, Fabio Rinaldi
Universitat ZurichInstitut fur Computerlinguistik
February 4, 2020
MedMon February 4, 2020 1 / 15
Introduction
The MedMon Project:Monitoring of internet resources for pharmaceutical research anddevelopment
Innosuisse project together with a big pharmaceutical companyand researchers from the University of Applied Sciences of theGrisons
Main motivations and goals:Discovering of unmet medical needs of rare disease patients
Assessment of patients’ perception of their specific disease burden
Early detection and tracking of epidemic outbreaks
Optimization of patient recruitment for clinical studies (location, agegroups, etc.)
Monitoring of conversations on clinical trials and treatments
MedMon February 4, 2020 2 / 15
MedMon: Data Sources
Use-cases:ParkinsonsMultiple SclerosisAngelmansViral Infections
Data Access via MedMonPortal:
TwitterRedditPatient Forums (e.g.parkinson.forum.org)
Languages:English, German, French,Spanish(but focus on English)
MedMon February 4, 2020 3 / 15
MedMon Processing Pipeline
MedMon February 4, 2020 4 / 15
Data Retrieval
Retrieval of relevant micro-posts from different data sourcesTwitterRedditMedical Forums
So far retrieval by keyword search: “Parkinson’s Disease”, “PD”,“Multiple Sclerosis”, “MS”Future: Disease-agnostic retrieval of micro-posts
MedMon February 4, 2020 5 / 15
Pre-processing and Removal of Duplicates
Domain-specific data pre-processingAims: reduce sparsity; remove bias introduces by duplicatesExamples:
User names and numbers are replaced with placeholders“@Jah423” → “@user”, “70 kids” → “NUMBER kids”URLs are truncated to their domain names“http://tinyurl.com/yereux6” → “http://tinyurl.com”Hash symbols are stripped from hash tags“#flu” → “flu”Camel-cased expressions are split into their component words“SideEffects” → “Side Effects”Frequent colloquial abbreviations are resolved to their full version“w/” → “with”Repetitions of letters (> 3) are replaced with a single or double letter“greaaaaat” → “great”, “freeeeeze” → “freeze”
Identification and removal of duplicate micro-postMedMon February 4, 2020 6 / 15
Spam-filtering and Filtering by Disease Mentions
Identification of Micro-post relevanceDoes it mention a disease?“Having a blast at Takin’ the Mic tonight with Karl Parkinson andCaelainn Hogan @CaelainnH https://t.co/L2jr4y5HSG. ”Spam vs. relevant content“Seratopicin Healing Pain Relief this cream incorporates all-natural,vegan components. [...] This product can be used for maximum ofyour commonplace muscle and joint pains.”
MedMon February 4, 2020 7 / 15
Personal Health-Mention Identification
Identification of Personal Health Mentions (PHM)Is the micro-post about a specific patient?Is the micro-post authored by the patient, by a relative or ahealth-care provider?
Participation in Social Media Mining for Health Shared Task(SMM4H 2019)
Task 4: Generalizable identification of personal health experiencementionsFocus of shared task: generalize from two given healthcontexts/health conditions (influenza vaccination and infection) inthe training set to three unknown health contexts in test set
MedMon February 4, 2020 8 / 15
SMM4H 2019:Our Best-Performing Approach to PHM Identification
BERT Classifier: pre-trained Bidirectional Encoder Representationsfrom Transformers, with a sequence classification head (tuned to thetask)
Merging of all model parameters into a single modelAverage: pointwise average of all model parametersWeighted Average: models are weighted by their performance onthe development fold (measured as F-score and transformed bySoftmax).
MedMon February 4, 2020 9 / 15
SMM4H 2019:Our Best-Performing Approach to PHM Identification
Weighted average of model parameters:1 x Merged Model from models trained on vaccination9 x Merged Model from models trained on infection(Sub-models were averaged)all models trained for 4 epochs
MedMon February 4, 2020 10 / 15
SMM4H 2019: Results
Results Task 4 (PHM Identification)MedMon February 4, 2020 11 / 15
Identification of Disease Mentions and Symptoms
Span-level annotation of Diseases/Disorders/HealthConditions (+ normalization)
Recognition of layman terminology“Angelman Syndrome”, “Angelmans”, “Angel”, “Angels”, “AS”
Span-level recognition of Symptoms (+ normalization)Recognition of layman terminologyRecognition of multi-word expressions“I’ve been having troubles falling asleep at night”Distinction from adverse drug reaction mentions (ADRs)“ One woman new to our support group never had a tremor untilrecently when she was given pain killer containing epinephrineand she has not stopped having tremors since that dentalappointment.”
MedMon February 4, 2020 12 / 15
Identification of Patients’ Attributes
Span-level recognition of Medication IntakeSpan-level recognition of Patients’ Attributes
Location: where is the patient locatedAge: age at post and current age; Estimated year of birthGender: female vs. male vs. third gender
Currently: dictionary look-up and regular expressions atmicro-post levelFuture: Patients’ attributes and medication intake should berecognized in relation to mention of patient
MedMon February 4, 2020 13 / 15
Conclusions: (Current) Challenges
Disease-agnostic retrieval of micro-postsIdentification of micro-post relevance: filtering of spam,identification of personal health mentionSpan-level recognition of Symptom Mentions: Symptoms vs.ADRsSpan-level recognition of Patients’ Attributes and Medication(Intake): Recognition in relation to patient
MedMon February 4, 2020 14 / 15
Sequence Tagging for Concept Recogni onOntoGene Tools and CRAFT shared task 2019
Lenz Furrer, Joseph Cornelius, Fabio Rinaldi
Ins tute of Computa onal Linguis csUniversity of Zurich
February 4, 2020
The CRAFT Shared Task 2019
CRAFT: the Colorado Richly Annotated Full-Text corpus• 67 ar cles for training (released 2012)• 30 ar cles for tes ng (released now)• annotated with bio en es, dependencies and coreferences• one sub-task each in the compe on
Task (CA)• Named En ty Recogni on (NER) and
Normalisa on (NEN)• 10 en ty types, plain + extended→ 20 separate evalua ons
Team: Lenz Furrer, Joseph Cornelius
2/21
Tradi onal Approach to NER and NEN: PipelineA common human skin tumour is caused by activating mutations in beta-catenin.
NERO O O B-dis E-dis O O O O O O S-chem
3/21
Tradi onal Approach to NER and NEN: PipelineA common human skin tumour is caused by activating mutations in beta-catenin.
NER
NEN
O O O B-dis E-dis O O O O O O S-chem
...MESH:D012876 Skin Disease, ParasiticMESH:D012877 Manifestation, SkinMESH:D012877 Skin ManifestationMESH:D012878 Neoplasm, SkinMESH:D012878 Skin NeoplasmMESH:D012878 Cancer of the SkinMESH:D012878 Cancer, SkinMESH:D012878 Skin CancerMESH:D012883 Skin UlcerMESH:D012883 Ulcer, SkinMESH:D012887 Fractures, Skull...
skin tumour
3/21
NEN: Sequence-to-Sequence
Negacy Degefa Hailu (2019). “Inves ga on of tradi onal and deep neural sequencemodels for biomedical concept recogni on”. PhD thesis. University of Colorado at Denver,Anschutz Medical Campus
4/21
Joint Tackling of NER and NEN
• avoid error propaga on• mutual benefit of feedback across tasks
NER
NEN
NER NEN
pipeline joint training
5/21
Mul -Task Learning for Joint Neural NER+NEN
Takotsubo syndrome secondary to Zolmitriptan
Embedding layer (word + character)
B-DISEASE D054549-P
∑ ∑
U V
I-DISEASE D054549-P
∑ ∑
U V
O NULL
∑ ∑
U V
O NULL
∑ ∑
U V
B-CHEMICAL C089750
∑ ∑
U V
Figure 3: The main architecture of our neural multi-task learning model with two explicit feedback strategies for MER andMEN. The character embedding is computed by CNN in Figure 2. Then the character representation vector is concatenatedwith the word embedding before feeding into the Bi-LSTM. Dashed arrows from the left to the right is the feedback from MERto MEN. Dashed arrows from the right to the left is the feedback from MEN to MER. Orange arrows indicate dropout layersapplied on both the input and output vectors of Bi-LSTM.
For a k-layer Bi-LSTM tagger for MER and MEN we get:
MER(w1:n, i) = yiMER = argmaxyiMER
= fMER(vki )
MEN(w1:n, i) = yiMEN = argmaxyiMEN
= fMEN (vki )
vki = F k
θ (x1:n, i)
x1:n = E(w1), E(w2), ..., E(wn)
where E as an embedding function mapping each wordin the vocabulary into a d-dimensional vector, yi
MER isthe log-probabilities vector with the length of MER tagspace, yiMER is the output tag of MER, yi
MEN is thelog-probabilities vector with the length of MEN tag space,yiMEN is the output tag of MEN, and vk
i is the output ofthe kth Bi-LSTM layer as defined above. All the parametersare trained separately for MER and MEN because we modelMER and MEN as different sequence labeling tasks.
Multi-task Mode with Explicit Feedback StrategiesThe dependencies between MER and MEN inspire us to ex-plore their potential mutual benefits. In order to make themost of the mutual benefits between MER and MEN, wepropose to feed the above mentioned Bi-LSTM and its vari-ants into multi-task learning framework with two explicitfeedback strategies, as shown in Figure 3. This method (1)is able to convert hierarchical tasks into parallel multi-taskmode while maintaining mutual supports between tasks; (2)benefits from general representations of both tasks providedby multi-task learning; (3) is effective in determining bound-aries of medical named entities through explicit feedbackstrategies thus improves the performance of both MER andMEN.
We experiment with a multi-task learning architecturebased on stacked Bi-LSTM, CNNs and CRF. Multi-tasklearning can be seen as a way of regularizing model in-duction by sharing representations with other inductions.
We use stacked Bi-LSTM-CNNs-CRF with task supervisionfrom multiple tasks, sharing Bi-LSTM-CNNs layers amongthe tasks.
MER and MEN are hierarchical tasks and their outputspotentially have mutual benefits for each other as well. Itmeans MEN can take MER results as input, while the re-sults of MEN can be also useful for MER. However, MERand MEN can be implemented independently as different se-quence tagging tasks. Therefore, we 1) follow the popularstrategy of multi-task learning to share representations be-tween MER and MEN; and 2) propose to use mutual feed-back between MER and MEN, i.e., the result of MER is fedinto the MEN as part of the input and the result of MEN isfed into the MER as part of the input. The multi-task learn-ing with two explicit feedback strategies for MER and MENis defined as:
MER(w1:n, i) = yiMER = argmaxyiMER
= fMER(vMERi )
MEN(w1:n, i) = yiMEN = argmaxyiMEN
= fMEN (vMENi )
vMERi = vk
i ◦ (vki + yi
MENU)
vMENi = vk
i ◦ (vki + yi
MERV)
vki = F k
θ (x1:n, i)
x1:n = E(w1), E(w2), ..., E(wn)
where fMER(vMERi ) is the MER multi-class classification
function and fMEN (vMENi ) the MEN multi-class classi-
fication function. vMERi is the input of MER multi-class
classification function, which combines the output of theshared stacked Bi-LSTM-CNNs and the explicit feedbackfrom MEN. vMEN
i is the input of MEN multi-class classi-fication function, which combines the output of the sharedBi-LSTM-CNNs and the explicit feedback from MER. Uis the matrix to map the feedback from MEN to MER, Vmaps the feedback from MER to MEN. You can consider
820
Sendong Zhao et al. (2019). “A Neural Mul -Task Learning Framework to Jointly ModelMedical Named En ty Recogni on and Normaliza on”. In: Proceedings of the Thirty-ThirdAAAI Conference on Ar ficial Intelligence (AAAI-19), pp. 817–824. DOI:10.1609/aaai.v33i01.3301817
6/21
BiLSTM with Synchronous NER and NEN
ℎ→
�−2
ℎ �−2
��
�−2
S 13712
RanBP2
��−2 ��
�−2
��
�−2
13712
ℎ→
�
ℎ �
���
B 8608
Hexokinase
�� ���
���
8608
ℎ→
�+1
ℎ �+1
��
�+1
E 8608
I
��+1 ��
�+1
��
�+1
8608
...
...
ℎ→
�+2
ℎ �+2
��
�+2
O NIL
activities
��+2 ��
�+2
��
�+2
NIL
embedd
ing
BiLST
MNER
NEN
ℎ→
�−1
ℎ �−1
��
�−1
O NIL
modulates
��−1 ��
�−1
��
�−1
NIL
...
...7/21
Fine-Tuning BioBERT
BioBERT
[CLS] Tok1 Tok2 TokN. . .
E[CLS] E1 E2 EN. . .
C T1 T2 TN. . .
NIL CHEBI:8608 NIL. . .
PubMed(1M)
BERTE[CLS] E1 E2 EN. . .
C T1 T2 TN. . .
Train two independent models:• span detector (NER)• ID tagger (NEN)
8/21
Unseen IDs: Pretraining (BiLSTM)
Pretraining• pretrain on ontology names→ top 1000 concepts only→ copy output label to feature input→ 20 epochs• then con nue training on corpus sentences
(6-fold cross-valida on, early stopping)
9/21
Unseen IDs: Back-Off (BERT)
O O
RanBP2 modulates Hexokinase I activitiesNIL NIL NIL
SPR:13712 PR:8608
PR:29546
RanBP2 modulates Hexokinase I activitiesPR:13712
PR:8608 PR:8608
PR:8608 PR:8608NIL NIL
B E
IDs
spans
OGER
predictions
10/21
OGER
[http://www.ontogene.org/resources/oger]
11/21
OGER: annota on service
The OntoGene’s Biomedical En ty Recogniser (OGER)• RESTful web service, using BTH terminologies• Allows annota on of a collec on of documents.• Evaluated in the Bio Text Mining services challenge BioCrea ve/TIPS
• best results according to several of the evalua on metrics.
[http://www.ontogene.org/resources/oger]
12/21
BioCrea ve V.5 / TIPS
13/21
Bio Term Hub
The Bio Term Hub is an aggregator of biomedicalterminologies, which is kept up-to-date byautoma cally integra ng content from manuallycurated data bases.[http://www.ontogene.org/resources/termdb]
14/21
Bio Term Hub
[http://www.ontogene.org/resources/termdb]15/21
Metrics
Slot Error Rate (Bossy et al. 2013)• count matches (M), inser ons (I),
dele ons (D), subs tu ons (S)→ subs tu ons: penalty for incorrect
boundaries and distance to correctontology entry
→ find op mal alignment
SER =S+ I+ D
N
Precision/Recall/F1• count matches (M) and par al
matches (Mp = 1− S)
Recall =M+Mp
N
Precision =M+Mp
P
N: # of ground-truth annota ons, P: # of predicted annota ons
16/21
Compe ng Systems
• Baselineplain dic onary-based system (OGER)
• BiLSTM• 6-fold CV, early stopping• + pretraining• + mul ple runs per fold, pick best
• BioBERT fine-tuned for 55 epochs• ID tagger• span tagger combined with OGER• combina on of the previous two (back-off)
17/21
Results: Legend
0.0
0.2
0.4
0.6
0.8
1.0
F10.0
0.2
0.4
0.6
0.8
1.0
SER
OGER (baseline)BiLSTMBiLSTM, pretrainedBiLSTM, pretrained, pick-bestBERT-IDsBERT-spans+OGERBERT-IDs+BERT-spans+OGER
18/21
Results
CHEBI0.00.20.40.60.81.0
F1
CL GO_BP GO_CC GO_MF MOP NCBITaxon PR SO UBERON
CHEBIEXT
0.00.20.40.60.81.0
F1
CLEXT
GO_BPEXT
GO_CCEXT
GO_MFEXT
MOPEXT
NCBITaxonEXT
PREXT
SOEXT
UBERONEXT
0.00.20.40.60.81.0
SER
0.00.20.40.60.81.0
SER
19/21
Examples: Correct Unseen IDs
BERT-IDs+BERT-spans+OGER predicts CHEBI_PR_EXT:somatostatin (twice):However, the somatosta n receptor 2 (SSTR-2) antagonist PRL-2903 does not inter-ferewith the ability of glucose (at 3 and7mM) to inhibit glucagon secre on frommouseislets [47].
BERT-IDs+BERT-spans+OGER predicts CHEBI:60004:Adult mouse testes were homogenized in a buffer containing 20 mM Tris, pH 7.5, 100mMKCl, 5mMMgCl2, 0.3%NP-40, 40U/ml of Rnasin ribonuclase inhibitor (Promega,Madison, WI), and a mixture of 10 protease inhibitors provided [...]
BiLSTM pick-best predicts PR:000008373:Decreased Osteogenic Differen a on Correlates with Abnormal Distribu on of Cx43
20/21
Unseen IDs: Precision/Recallunique occ. OGER pretraining BERT spans BERT back-off
CHEBI 110 447 0.33 / 0.65 1.00 / 0.00 0.74 / 0.47 0.70 / 0.11CHEBI_EXT 134 538 0.37 / 0.71 1.00 / 0.00 0.62 / 0.49 0.76 / 0.09CL 52 484 0.72 / 0.31 1.00 / 0.00 0.88 / 0.22 0.59 / 0.04CL_EXT 52 484 0.72 / 0.31 1.00 / 0.00 0.71 / 0.25 0.71 / 0.11GO_BP 120 484 0.21 / 0.25 1.00 / 0.00 0.56 / 0.12 0.66 / 0.06GO_BP_EXT 126 508 0.22 / 0.28 1.00 / 0.00 0.29 / 0.18 0.62 / 0.07GO_CC 32 184 0.19 / 0.35 1.00 / 0.00 0.50 / 0.17 0.49 / 0.06GO_CC_EXT 36 231 0.28 / 0.47 1.00 / 0.00 0.58 / 0.19 0.60 / 0.07GO_MF 1 1 0.10 / 0.50 1.00 / 0.00 1.00 / 0.00 1.00 / 0.00GO_MF_EXT 73 416 0.38 / 0.22 1.00 / 0.00 0.57 / 0.15 0.54 / 0.04NCBITaxon 40 87 0.02 / 0.50 1.00 / 0.00 0.40 / 0.34 0.75 / 0.22NCBITaxon_EXT 44 95 0.02 / 0.54 1.00 / 0.00 0.43 / 0.35 0.85 / 0.25PR 278 4782 0.26 / 0.86 0.63 / 0.00 0.81 / 0.74 0.69 / 0.15PR_EXT 309 5156 0.27 / 0.84 0.34 / 0.01 0.84 / 0.73 0.65 / 0.20SO 16 101 0.04 / 0.87 1.00 / 0.00 0.10 / 0.06 0.52 / 0.02SO_EXT 25 123 0.05 / 0.78 1.00 / 0.00 0.28 / 0.47 0.85 / 0.41UBERON 203 1297 0.47 / 0.33 0.69 / 0.00 0.74 / 0.25 0.59 / 0.06UBERON_EXT 207 1308 0.47 / 0.33 0.87 / 0.00 0.78 / 0.27 0.60 / 0.06
21/21
Intro Projects
Conclusions
• Solid, easy-to-use, efficient dictionary-based solution with constantly up-to-dateresources
• Bio Term Hub: a one-stop site for obtaining up-to-date biomedical terminologicalresources. http://www.ontogene.org/resources/termdb
• OGER: an efficient text annotation tool using the BTH terminologies. Providesspans and IDs (NER and CR) http://www.ontogene.org/resources/oger
• Integration with state-of-the-art disambiguation approaches for specificapplications
• Applications over several text types: literature, clinical records, social media
http://www.ontogene.org/https://github.com/OntoGene/craft-st