a few contributions of the sifr project · 2015. 12. 21. · sifr axes of research (4/8): semantic...
TRANSCRIPT
A few contributions of the SIFR project
Semantic Indexing of French
biomedical Resources
Data seminar- December 10th 2015 – LIRMM – University of Montpellier
Clement Jonquet, Mathieu Roche, Sandra Bringay et al.
Biologists have adopted ontologies
To provide canonical representation of scientific knowledge
To annotate experimental data to enable interpretation, comparison, and discovery across databases
To facilitate knowledge-based applications for Decision support Natural language-processing Data integration
But ontologies are: spread out, in different formats, of different size, with different structures UMLS (163 databases) :
~ 9 000 000 terms in English / ~ 1 000 000 tems in Spanish
~ 330 000 terms in French
Comparison of the approaches [IWBBIO'14]
Annotation challenge Explosion of biomedical data: diverse,
distributed, unstructured… not link to ontologies
Hard for biomedical researchers to find the data they need
Data integration problem
Translational discoveries are prevented
Good examples
GO annotations
PubMed (biomedical literature) indexed with Mesh headings
ONTOLOGIES
RESOURCES
Semantic Indexing of French Biomedical Data Resources project
… in collaboration with…
Use biomedical ontologies-based annotations end-user applications
MeSH 2015 SNOMED …
Annotation
Enrichment
Researchers
Ontologies
produce
DATA
A
B
SIFR axes of research (1/8): Design of the SIFR (French) Annotator service Deployment of a local instance of BioPortal at LIRMM
16 French terminologies imported from UMLS, EHTOP & BioPortal
http://bioportal.lirmm.fr/annotator
New improvement to the annotation workflow
Automatic term extraction measures (C-value, LIDF-value, etc.)
Scoring of annotations & representation in RDF using the AO [SWAT4LS 2014]
SIFR axes of research (2/8): Dealing with multilingualism within BioPortal Status of multilingualism in BioPortal – quite negative
Set of propositions [MSW 2014]
Representation of natural language property for an ontology
Representation of the distinction between ontologies
Representation of relation between ontologies
Representation of multilingual translation mappings
Reconciliation of multilingual mappings
Currently being tested/implemented within our local instance
SIFR axes of research (3/8): Automatic extraction of biomedical terminology from text Context of the PhD of Juan Antonio Lossio
[TALN 2014][PolTAL 2014][IRJ 2015]
BioTex , software http://tubo.lirmm.fr/biotex [ISWC 2014]
Work in French, English, and Spanish
Motivations for automatic terminology extraction
Experiment and validate approaches for French data
Contribute to the ontology enrichment process
Acquire some NLP expertise for the annotation workflow
SIFR axes of research (4/8): Semantic distance framework
Automatically compute existing (Rada, Wu&Palmer, Resnik) semantic similarity measures over BioPortal ontologies
For a given concept get all semantically closed concepts
Get the semantic distance between 2 concepts
Collaboration with LGI2P to reuse Semantic Measure Library (SML) within BioPortal
1st prototype: http://tubo.lirmm.fr/BioMedicalSemantic/web/app_dev.php
To include SML within BioPortal backend to bring semantic distance services to the ontologies and data annotated
SIFR axes of research (5/8): Informal patient data analysis Dealing with public patient data on blogs, forums and
tweets
Detection of emotion [EGC 2014][eTELEMED 2014]
Patient vocabulary (crabe vs. cancer)
Project “Parlons de nous” (www.lirmm.fr/patient-mind)
MSH-M
A vocabulary currently being constructed
Hosted and available in our local instance of BioPortal
Used for annotations, indexing
SIFR axes of research (6/8): Semantic indexing of semantic Web data and social Web data - Viewpoint project Graph based knowledge representation formalism
Collaboration with P. Lemoisson (CIRAD)
PhD project of Guillaume Surroca
First prototype for semantic search over HAL-LIRMM publications [IC2014]
Toward a model for Serendipity and collective intelligence [KEOD2015]
SIFR axes of research (7/8): pharmacogenomics use case PGx studies how individual gene variations cause variability in
drug responses
Validation of pharmacogenomics state-of-the-art knowledge on the basis of practice-based evidences
Compare pharmacogenomics literature (in English) and electronic health records (in French)
EHRs from Paris (HEGP) & St Etienne hospitals
Improvement of the Annotator in order to handle clinical data: negation, disambiguation, modularity, temporality
ANR
Collaborative action lead by Adrien Coulet (LORIA)
Stanford is in the loop (Russ, Mark, Michel, Nigam)
SIFR axes of research (8/8): AgroPortal project In collaboration with the Institute of Computational Biology
of Montpellier
Design of a semantic annotation workflow for plant data - collaboration with IBC project [CO-PDI 2014]
AgroLD: to build an RDF knowledge base to house plant data resources: SouthGreen, Gramene, OryGeneDB… [RDA 2014]
In collaboration with CIRAD/IRD, INRA, and Bioversity International
Experiment NCBO technologies for the plant community
Help the design and evolution of Cropontology.org
1-year postdoc starting in June
Interactions with NSF Planteome project (P. Jaiswal, L. Cooper)
Terminology extraction in Biomedecine (step 3)
term1
term2 … termn
Linguistic
Statistic
Graph Web
(Ranking)
(Re-ranking)
1
2
LIDF-value
Linguistic
Statistic
term1
term2 … termn
Graph Web
(Ranking)
TF-IDF and Okapi BM25
Keyword1
Keyword2 …
Keyword
…
Keyword
…
Keyword
…
Linguistic
Statistic
term1
term2 … termn
Graph Web
(Ranking)
Linguistic
Statistic
term1
term2 … termn
Graph Web
(Re-ranking)
WEB
Web-based: WAHI
« buschke lowenstein tumor »
buschke lowenstein tumor lowenstein
tumor
buschke
nb = number of hits!
Experiments (quantitive evaluation)
Precision@k K = 100, 500, …, 20000
Experiments (qualitative evaluation)
http://www.ontologos-corp.com/corporate/index.php http://www.varapp.org/
Ontologos, VARAPP
500 candidate terms extracted from their documents
Objective: extraction of relevant biomedical terms (i.e. those which can be added to a biomedical terminology)
Precision
True Biomedical Terms 74.6 %
False Biomedical Terms 25.4 %
http://tubo.lirmm.fr:8080/ontologos/
bêta-2 mimétiques
bêta-2 agonistes
dosage des ige spécifiques
suivi des maladies allergiques
A few conclusions
Future work
Continue to move different prototypes into production
Release of the French Annotator
Find more use cases
Collaboration with the plant/agro community
Continue reusing and contributing to NCBO technology
Online resources
Web page: www.lirmm.fr/sifr
To be turned into a real small web site
Task & team: https://www.researchgate.net/projects
Feature removed by RG in February (to be replaced)
Code repository: https://github.com/sifrproject
13 developpers
10 repositories
Publications: http://bit.ly/194ImnR
Direct link to HAL-LIRMM platform with advance search features