open science enabled by text mining - sciencesconf.org · science ouverte et fouille de textes ^les...
TRANSCRIPT
![Page 1: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/1.jpg)
Open Science enabled by Text Mining
Sophia AnaniadouNational Centre for Text Mining
www.nactem.ac.uk
School of Computer Science
The University of Manchester
![Page 2: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/2.jpg)
Science Ouverte et Fouille de Textes
“Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les visibles” (Camus)
“ […] on a des instruments à portée de main qui, si correctement développés, donneront accès aux et la maîtrise des connaissances héritées des âges.” (Préface à l’article de Vannevar Bush “As we may think”, The Atlantic, juillet 1945)
2
![Page 3: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/3.jpg)
What hinders Open Science?
• Silos, fragmentations, literature grows daily
• We cannot read faster, we read more narrowly..
• Reductionist approaches, incomplete and uncertain information, complex problem space
• We cannot make links between disciplines
Scientific discovery by serendipity?
Ananiadou 3
![Page 4: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/4.jpg)
Big Textual Data
Big Text Data
Unstructured Data
Knowledge
Structured Data
SemanticsText Mining
organise, analyse
4
OPEN DATA, OPEN CONTENT
OPEN SCIENCE
![Page 5: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/5.jpg)
Text Mining
5
Otherdata
ApplicationsSemantic search
Ontologies ,
![Page 6: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/6.jpg)
UK National Centre for Text Mining
• 1st publicly funded national text mining centre
• Location: Manchester Institute of Biotechnology
• Phase I - Biology (2005-2008)
• Phase II - Biology, Medicine, Social Sciences (2008-2011)
• Phase III – Biology, Medicine, Health, Biodiversity, Humanities, Social Sciences; Fully sustainable centre (2011-)
www.nactem.ac.uk
![Page 7: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/7.jpg)
Text Mining for Open Science
Tools, Resources, Infrastructure, Services
• Tools for extracting concepts, relations, events, epistemic knowledge
• Resources: lexical repositories, annotated data stores
• Supporting semantic search, developing models, experiments, knowledge discovery, hypothesis generation
• Interoperable infrastructure
Communities
• Biology
• Medicine
• Digital Humanities
• Computational Social Science
• Biodiversity
• Chemistry
• Public Health
• Pharmacovigilance
…….
7
![Page 8: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/8.jpg)
– Eight different types: chemicals, diseases, drugs, genes, metabolites, proteins, species and anatomical entities
• Covers entire PubMed
– Over 26M abstracts
– Automatic daily updates
Semantic search from PubMed
Thalia: Text Mining for Highlighting, Aggregating, and Linking Information in Articles
Nactem-copious.man.ac.ukSoto, Przybyla, Ananiadou2018 Bioinformatics
![Page 9: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/9.jpg)
This search strategy returns 125,679 abstractsUnnecessary workload
Beyond keyword search
![Page 10: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/10.jpg)
Faceted searchSpecify the semantic category of query terms
Search results narrowed down to 18,457Using faceted search , we reduced number of
papers that we need to read by 85%
Search based on semantic types
![Page 11: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/11.jpg)
We used the acronym ‘GAD’ to search for ‘Generalised Anxiety Disorder’
Search engine returned 7387 documents
Disambiguation
![Page 12: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/12.jpg)
Thalia allows acronym disambiguation
‘GAD’ has two main possible full forms, i.e., ‘glutamic acid decarboxylase’ (gene) and ‘generalized anxiety disorder’(disease)
Dealing with Acronyms
![Page 13: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/13.jpg)
By disambiguating the acronym in our search strategy we reduced the number of articles from 7,387 to 1,674 (increased specificity)
Dealing with Acronyms
![Page 14: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/14.jpg)
• Each named entity is color-coded
• Multiple-typed entities are visualised with a white rectangle
• Entities are linked to the corresponding URI in the ontology
Thalia: document view
![Page 15: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/15.jpg)
Understanding Mechanisms for Pathway Construction and Modelling
• Is it enough to mine just correlations from big data?
• Understanding not only of the components but the system
Ananiadou 15
mTOR pathway: 964 entities, 777 reactions, 519 papers
Caron, et al. Mol Syst Biol., 6(1)
A MANUAL PROCESSInevitable gaps buildingmodels
![Page 16: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/16.jpg)
Evidence-based science: linking text with knowledge
TEXT MINING
KNOWLEDGE
![Page 17: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/17.jpg)
Linking Knowledge with Text Curation
17
• Key to understanding biological systems• Models need verification and maintenance
(i.e., annotation/curation)• Scale and speed of literature challenging • Annotation/curation remains largely a
manual task of incorporating knowledge from scientific publications
Pathways
![Page 18: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/18.jpg)
Ananiadou 18
The Big Mechanism: reading, assembly, experiments
http://nactem.ac.uk/big_mechanism/
![Page 19: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/19.jpg)
From concepts to events
1 Concept recognition
2 Interaction recognition
3 Concept and interaction identification
DrugBank:DB06712 DrugBank:DB00682 DrugBank:DB04610
![Page 20: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/20.jpg)
Event Extraction
20
Type: Gene_expressionTrigger: ExpressionTheme: aurora B
Event E1
Type: PhosphorylationTrigger: phosphorylationTheme: S6K1
Event E2
Type: PhosphorylationTrigger: phosphorylationTheme: 4E-BP1
Event E3
Type: Positive_regulationTrigger: enhancesCause: E1Theme: E2
Event E4
Type: Positive_regulationTrigger: enhancesCause: E1Theme: E3
Event E5
![Page 21: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/21.jpg)
UncertaintyExamples from Big Mechanism data
These results indicate that FLCN can interact directly with RagA via its GTPase domain.
Altogether, these results show that cobalt could affect both p53 and HIPK2 activity.
To test if endogenous hPGAM5 interacts with hPINK1, we first generated an anti-hPGAM5
antibody
We hypothesize that unphosphorylated cdr2 interacts with c-myc to prevent c-myc
degradation
Therefore, AFP may interact with STAT3 in the signal pathway for chemotherapeutic efficiency
of agents on AFPGC.
These data suggest that PI3K and βARK1 form a macromolecular complex within the cell.
Therefore, LiCl might inhibit GSK3β in different ways
We then examined whether netrin-2 enhances the interaction between Cdo and Stim1.
![Page 22: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/22.jpg)
Uncertainty cues
22
![Page 23: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/23.jpg)
Events and their interpretation
for
BindingProtein
inbinding toMUC1
Theme 1
RASsuggestResults
Event trigger
PKM2
Protein
Theme 2
Event argument
Entityargument
Chemical
is
Regulation
BRAF required
Cause
notthat
SIMPLE EVENT
COMPLEX EVENT
Theme
*Complex events have at lest one argument that is an event on its own
Event trigger
![Page 24: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/24.jpg)
Uncertainty for model curation and experiments
• Uncertainty scoring as an expressive confidence measure
• Value for each event mentioned in a sentence
– Consolidated uncertainty values from different papers
• Effort to decrease manual effort and select more certain events
24
Zerva, C., Batista-Navarro, R., Day, P. and S. Ananiadou (2017) Using uncertainty to link and rank evidence from biomedical literature for model reconstruction, Bioinformatics
![Page 25: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/25.jpg)
LitPathExplorer: a confidence-based tool for exploring pathway models
1. Enabling flexible search and exploration of biomolecular pathway networks – different views of the data– various interactive functionalities
2. Provide a means for making existing evidence in the scientific literature available to support corroboration
3. Facilitate the discovery of new interactions that are not yet part of a given model
4. Allow the user to become an active participant of the analytical process
quantify confidence in the events
25Video: http://nactem.ac.uk/LitPathExplorer_BI/LitPathExplorer.mp4
![Page 26: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/26.jpg)
Search
• A pathway model can be searched by providing:
– event types,
– entities,
– and/or roles for each entity in the reaction
• Multiple queries can be combined in a Boolean search
26
![Page 27: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/27.jpg)
Inspector, quantifying the confidence
27
Confidence breakdown
![Page 28: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/28.jpg)
Word tree visualisation: Contrast event mentions across the corpus
28
Sentences can be inspected furtherupon interaction
Vertical arrangement and gray scaledenotes event confidence
![Page 29: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/29.jpg)
Network Viewer: Discovery modeExtending the model with events found in the
literature
29
![Page 30: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/30.jpg)
30
Discovery modeDifficult to explore when too many candidate events are found
![Page 31: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/31.jpg)
History of Medicine
• http://nactem.ac.uk/hom
• Archives
– British Medical Journal articles (380,000)
– London Medical Office of Health reports (5,000)
31
![Page 32: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/32.jpg)
related terms
Current search criteriaFrequency of “pulmonary consumption” over time
![Page 33: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/33.jpg)
“pulmonary tuberculosis” is added to
query builder
Terms related to “pulmonary consumption” and “pulmonary
tuberculosis”
33
![Page 34: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/34.jpg)
Number of “Environmental” entities
in the document
“Environmental” entity instances in the document
“Environmental” entities are highlighted
34
![Page 35: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/35.jpg)
Cases where specific causes of tuberculosis are
mentioned in the document
Phrases identifying Causality events are
highlighted
35
![Page 36: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/36.jpg)
36
Data driven methods complementing human hypothesis generation
Rapid mining of candidate hypotheses from text, validated against experimental data
• Migraine and magnesium deficiency• Indomethacin and Alzheimer’s disease • Using thalidomide for treating a series of diseases
such as acute pancreatitis and chronic hepatitis C
Supporting hypothesis generation
![Page 37: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/37.jpg)
FACTA+
37
![Page 38: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/38.jpg)
… However, further decreases in branched-chain amino acid levels indicate that caffeinemight promote deeper fatigue than placebo
38
![Page 39: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/39.jpg)
E-cadherin is associated with Parkinson’s disease viaCASS4, SNAIL3, transcription factor EB, etc.
FACTA+
39
![Page 40: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/40.jpg)
Text mining workflows
• A pipeline that executes particular tools and resources in order; text Mining is not monolithic
• Semantic search
• Different workflows can be created, compared and evaluated by the ability to seamlessly “mix and match” various versions of components
• I know I can build the workflow from the components but which one is the preferred for my task?
PoSTagger
Dictionary Lookup
NE Extraction
Chunking ParsingSemantic
Query
40Ananiadou
![Page 41: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/41.jpg)
Text mining workflow for uncertainty
41
Detection of negated
events
Detection of uncertain
events
![Page 43: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/43.jpg)
National Plan for Open Science
• What is the critical point in the plan?
– Open data, open content
– Linking data with content
– Reproducibility
– Research objects
– Code, experiments, annotations, databases
– Common Infrastructure for sharing and customisation
– Standards FAIR
Ananiadou 43
![Page 44: Open Science enabled by Text Mining - Sciencesconf.org · Science Ouverte et Fouille de Textes ^Les mots nous cachent davantage les choses invisibles qu’ils ne nous révèlent les](https://reader033.vdocuments.us/reader033/viewer/2022042220/5ec6701563d9f923c008978b/html5/thumbnails/44.jpg)
What action to prioritise?
• Access to data and privacy protection
– difficulty gaining access to EHRs
– cross-linked with other clinical, public health, biology
data and the literature
– better ways to anonymise data
– TDM would contribute massively to personalised
medicine
– Security
• Training. Lack of trained personnel in TDM
• Lack of knowledge in existing tools and resources
Ananiadou 44