improving interoperability of text mining tools with bioc
TRANSCRIPT
Ritu Khare, Chih-‐Hsuan Wei, Yuqing Mao, Robert Leaman, Zhiyong Lu National Center for Biotechnology Information (NCBI) National Institutes of Health
1
¡ Motivation ¡ Our Text Mining Tools ¡ Building BioC Compatible Tools ¡ Results and Conclusions
2
¡ Building complex text mining applications requires combining different tools developed by different groups
¡ Each tool is developed independently § Group conventions: data representation, programming, execution environments
¡ Heterogeneity in data/text representations limits and slows down § tool interoperability, application development, and research and innovation.
3
EXISTING SOLUTIONS ¡ Unstructured information
management architecture (UIMA) – 2004
¡ General Architecture for Text Engineering (GATE) -‐ 2009
¡ Steep Learning Curve ¡ Substantial Development
and Re-‐development time
BIOC ¡ Minimal change
requirement to existing applications and datasets
¡ BioC family § XML formats to present text
documents and annotations § Functions (C++, JAVA) to read/
write documents in BioC format
4
¡ Motivation ¡ Our Text Mining Tools ¡ Building BioC Compatible Tools ¡ Results and Conclusions
5
6
DNormDNorm
tmVartmVar
SR4GNSR4GN
tmChemtmChem
GenNormGenNorm
PubMed Abstract
Disease Mentions with MEDIC IDs
Mutation Mentions
Species Mentions with Taxonomy IDs
Chemical Mentions
Gene Mentions with Entrez IDs
Annotations for Various BioConcepts
Concept Recognition and Annotation Toolkit
PubMed Abstracts or Full-‐Text Articles
DNorm Disease Mentions with MEDIC IDs (F-‐measure= 80.90%)
tmVar Mutation Mentions (F-‐measure= 91.39%)
SR4GN Species Mentions with Taxonomy IDs (F-‐measure= 85.42%)
tmChem Chemical Mentions (F-‐measure= 88.27%)
GenNorm Gene Mentions with Entrez IDs (F-‐measure= 92.89%)
Annotations with various BioConcepts
NER tools Programming Language Method
Formats
PubMed/ PMC XML Free Text
PubTator Format
GenNorm Format
tmChem (Chemical) Java, Perl, C++ CRF √ √
DNorm (Disease) Java CRF √ √
tmVar (Mutation) Perl, C++ CRF √ √ √
SR4GN (Species) Perl Rule-‐based √ √ √
GenNorm (Gene) Perl Statistical √ √ √
PubTator Perl, JavaScript Web server √ √
7
8
¡ Official corpus for BioCreative IV GO Task ¡ 200 full-‐text articles along with their gene ontology (GO) annotations § evidence sentences § gene/protein entities, GO terms, GO evidence codes
¡ Developed by expert GO curators via a web-‐based annotation tool.
9
¡ Motivation ¡ The NCBI Text Mining Toolkit ¡ Building BioC Compatible Tools ¡ Results and Conclusions
10
¡ The BioC family § XML DTD ▪ how to present text
document and annotations (higher-‐level semantics)
§ C++ and Java Libraries ▪ functions/classes to read/
write documents in BioC format
¡ BioC Recommendations § Full-‐text articles and
Annotations ▪ Present in BioC XML Format ▪ Keep in separate files
§ Key file ▪ describes how data should
be interpreted in the annotation file (lower-‐level semantics)
▪ needs to be created for a specific type of data.
11
¡ Steps taken to comply our tools with BioC § Created the key file § Modified the input/output formats of the tools ▪ Added the BioC format as a new option for input/output
¡ Challenges
§ Defining an appropriate key file § Offset calculation § Translating web-‐based annotation file to BioC annotation file (Unicode to ASCII conversion)
12
¡ Motivation ¡ Our Text Mining Tools ¡ Building BioC Compatible Tools ¡ Results and Conclusions
13
¡ Common key file for all tools since they are designed for similar types of data
14
id: PubMed id.
Passage: e.g., title, abstract
Offset of the passage
Id of the bioconcept
Offset of the bioconcept
Length of the bioconcept
Mention of the bioconcept
date: the time annotation create
NER tools bioconcept
PubMed/ PMC XML BioC
Free Text PubTator GenNorm
tmChem Chemical √ √ √
DNorm Disease √ √ √
tmVar Mutation √ √ √ √
SR4GN Species √ √ √ √
GenNorm Gene √ √ √ √
PubTator N/A √ √ √
15
Our Text Mining Toolkit available for public access: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/
16
BioC Article File
BioC Annotation File
DNorm tmVar tmChem SR4GN GenNorm
Identifying Disease
Identifying Mutation
Identifying chemical
Identifying Species
Identifying Gene
17
id: PubMed id.
passage: title
date: the time file download
passage: abstract
18
Id of the bioconcept
Offset of the bioconcept
Length of the bioconcept
Mention of the bioconcept
Type of the bioconcept
Time: Time annotation created.
ID: PMID of the article.
GO term: e.g., receptor-‐mediated endocytosis
GO evidence code: e.g., Inferred from Mutant Phenotype (IMP)
Curatable entity: i.e., gene or gene product
Text: GO evidence text
¡ Our experience with BioC § Minimal changes required to prepare BioC versions § Easy to learn and use § Improved interoperability within the toolkit
¡ Implications § Improved interoperability ▪ With other tools to build sophisticated applications
§ The key file could evolve as a standard for concept recognition and normalization tasks
§ Anticipate broader usage of our tools as BioC gains popularity
20
¡ BioC Developers § W. John Wilbur § Rezarta Islamaj Doğan § Donald Comeau
¡ Intramural Research Program of the NIH, National Library Medicine
21
¡ Chih-Hsuan Wei § [email protected] § +1 301-594-5290
22