openaire-coar conference 2014: argo - a platform for interoperable and customisable text analytics,...
DESCRIPTION
Presentation at the OpenAIRE-COAR Conference: "Open Access Movement to Reality: Putting the Pieces Together", Athens - May 21-22, 2014. Argo: a platform for interoperable and customisable text analytics, by Sophia Ananiadou - School of Computer Science, Director, National Centre for Text Mining, University of ManchesterTRANSCRIPT
Argo: a platform for interoperable and customisable text mining
Sophia Ananiadou National Centre for Text Mining
School of Computer Science
The University of Manchester
Overview
• Sharing tools, resources and text mining workflows
• Challenges
• Interoperable infrastructure for processing and annotation
2 Open AIRE-COAR Conference Ananiadou
NaCTeM
• 1st publicly funded national text mining centre
• Location: Manchester Institute of Biotechnology
• Phase I - Biology (2004-2008)
• Phase II - Biology, Medicine, Social Sciences (2008-2011)
• Phase III – Biology, Medicine, Humanities, Social Sciences; Fully sustainable centre (2011-)
www.nactem.ac.uk
Challenges
Language Technology
Languages English French German Spanish Portuguese Italian Polish …. Chinese Hindu Urdu Japanese Korean…. Tasks
Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery
….
Domains Finance/Business Health Biology Social Sciences Humanities….
Text Types Newswire Scientific Literature Full papers/abstracts Twitter Patents Clinical records, EMR Textbooks, monographs Online forums….
Technology Sentence Splitter Paragraph Splitter NP Chunkers C-parser D-parser Semantic parser NE recognizers Relation recognizers …….
Diversity of Languages
Diversity of Contexts
Diversity of Applications
TM Workflows
TM Modules
Shared!
4 Open AIRE-COAR Conference Ananiadou
Metadata Languages English French German Spanish Portuguese Italian Polish …. Chinese Hindu Urdu Japanese Korean… Tasks
Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery
….
Language Technology
Linguistic Resources Knowledge Resources
Resource-Rich
Big Data Big Text
Cloud Computing Crowd Sourcing
Big Ontology
Text Types Newswire Scientific Literature Full papers/abstracts Twitter Patents Clinical records, EMR Textbooks, monographs Online forums….
Domains Finance/Business Health Biology Social Sciences Humanities….
5 Open AIRE-COAR Conference Ananiadou
OPEN SCIENCE
Requirements from TM infrastructure
• Modularity of TM modules
• Interoperability among TM modules and resources
• Generic across different languages, domains, and text types
– Adaptability
6 Open AIRE-COAR Conference Ananiadou
Module
Interoperability and Adaptability
Module
Module
Resources Dictionaries Ontologies
Adaptation
Rule Writing
(Annotated) Text
Interoperability and Adaptability in Resource-rich TM INFRASTRUCTURES!
Dependency Parser
English French German Japanese Greek
POS Tagger Named Entity Languages
Text Types Domains
7 Open AIRE-COAR Conference Ananiadou
Example: extracting proteins, annotations
8
GENIA
PennBioIE
AIMed
GENETAG
Incompatibility
Type definitions Texts
Problem: Inconsistency
Open AIRE-COAR Conference Ananiadou
The problem with incompatibility
• Difficult to evaluate NERs
9 Corpus C Corpus D
NER A
Which NER is best for my
task?
NER B
A: 93% B: 36% A is better than B.
A: 63% B: 90% B is better than A.
Why so different among different corpora and
NERs ?
Open AIRE-COAR Conference Ananiadou
Text mining workflows
• A pipeline that executes particular tools and resources in order
• Example: semantic search
• Various versions (language- or domain-specific) of basic components needed for different applications and tasks
• Different workflows can be created, compared and evaluated by the ability to seamlessly “mix and match” various versions of components
PoS Tagger
Dictionary Lookup
NE Extraction
Chunking Parsing Semantic
Query
10 Open AIRE-COAR Conference Ananiadou
Text mining workflows
Interoperability
Common Data Representation and Types
IBM Journal of Research and Development (2011)
U-Compare: a modular NLP workflow construction and evaluation system.
Kano, Y., Miwa, M., Cohen, K. B., Hunter, L., Ananiadou, S. and Tsujii, J.
11 Open AIRE-COAR Conference Ananiadou
Common Type System
• A common type system is required for the complete interoperability
• Solution: Maintain local type systems and bridge them via a sharable type system
12
A single common type is almost impossible to impose
for all developers.
U-Compare
Sharable Type System Local Type System A Local Type System B
bridging bridging
12 Open AIRE-COAR Conference Ananiadou
U-Compare Type System
Syntactic Level
Document Level
Semantic Level
13 Open AIRE-COAR Conference Ananiadou
POS tagger B
Sentence Splitter B
library
POS tagger A
Sentence Splitter A
NER
Sentence Splitter A Sentence Splitter A Sentence Splitter A
Sentence Splitter B Sentence Splitter B Sentence Splitter B
POS tagger A
POS tagger A
POS tagger A
POS tagger B
POS tagger B
POS tagger B
NER NER NER
Workflow A Workflow B Workflow C
F-Score A F-Score B F-Score C
U-Compare: Evaluate and Compare TM Worklfows
UIMA SD
OpenNLP SD
GENIA SD
UIMA Tokenizer
OpenNLP Tokenizer
GENIA Tagger as Tokenizer
GENIA Tagger
Stepp Tagger
OpenNLP Tagger
ABNER
MedT-NER
GENIA Tagger as NER
• Web-based application
• Interactive creation of workflows
• Cloud and high-performance computing
• Integrated TM/NLP processing system • GUI for workflow creation • Library of ready-to-use processing components • Statistics, visualizations, developer APIs • Supports UIMA • http://argo.nactem.ac.uk
15
Database: The Journal of Biological Databases and Curation (2012)
Argo: an integrative, interactive, text mining-based workbench supporting curation.
Rak, R., Rowley, A., Black, W.J. and Ananiadou, S
Structured Data
Remote Processing
Workflow Diagramming
Workflow Designer
Manual Editing
Annotator/Curator
Processing Components
Developers
UIMA Compliance
16 Ananiadou
Processing Components
• Approaching 100 components (U-Compare)
– Additional 50 will be added soon
• META-NET
• Developed or co-developed by NaCTeM
– Planned: Make the library open to others to contribute
• Generic Listener component
– Developers can plug in their own locally run UIMA component to a workflow in Argo
17 Open AIRE-COAR Conference Ananiadou
Remote Processing
• Single machine execution
– In-house high-performance machines
• Distributed processing
– HTCondor
– VMware vCloud (EBI) EUPMC
– Planned: EC2, Azure, …
18 Open AIRE-COAR Conference Ananiadou
Workflows
• Users create workflows as block diagrams
• Workflows can be shared among users
– Read only
– Planned: Read & write
– Planned: downloadable workflows
• Workflows can be deployed as web services
– Plain text (input only), XMI, RDF, BioC
19 Open AIRE-COAR Conference Ananiadou
Workflows view
20 Open AIRE-COAR Conference Ananiadou
Workflow Editor
21 Open AIRE-COAR Conference
Sample Use Cases
1 Recognition of chemical entities (chemical NER)
2 Semi-automatic curation of metabolic pathways
3 Evaluation of inter-annotator agreement
4 Information extraction as a Web service
Ananiadou Open AIRE-COAR Conference 22
Use Case 1: Chemical NER
Supplies gold standard corpus
Removes golden annotations so that they can be created
automatically
Combinations of syntactic and semantic components create
annotations
Compares and reports precision, recall and F1 of the different branches against the gold standard corpus
Chemical Entity Recogniser
• Chemical model evaluated at BioCreative IV CHEMDNER challenge
• The challenge
– Data: 10,000 manually annotated PubMed abstracts
– Automatically recognises names of chemical entities in text
24 Open AIRE-COAR Conference Ananiadou
Chemical Entity Recogniser
• Our solution
– Ranked unique mentions: ranked 1st out of 18 groups
– All mentions: ranked 3rd out of 19 groups
Subtask Precision % Recall % F-score %
Ranked unique mentions 91 85 88
All mentions 93 81 87
25 Open AIRE-COAR Conference Ananiadou
Use Case 2: Semi-automatic Curation – Metabolic Pathways
Search for relevant
documents
Manual correction of automatic annotations
NER for chemicals, genes, process
indicators
Linking to ontologies: CTD, ChEBI, UniProt
26 Open AIRE-COAR Conference Ananiadou
Save results in various formats,
e.g., RDF for querying and
incorporation into databases
Manual Annotation Editor
Create new annotations by selecting text
Create, modify or delete annotations
Edit details of annotations
Open a graphical interface to link annotations to
ontologies
27 Open AIRE-COAR Conference Ananiadou
Filtering and converting annotations
28 Open AIRE-COAR Conference Ananiadou
Manual Annotation Editor: linking to ontologies Automatic pre-
selection can be modified by the user
Details show ontology entry
webpage
29 Open AIRE-COAR Conference Ananiadou
Use Case 3: Information extraction as a Web service
Web service-enabled reader
Web service-enabled writer
34 Open AIRE-COAR Conference Ananiadou
Language Universal
• Reusable modules
• Generic TM modules: Competence
• Annotated Text, corpora: Performance
• Standards of Data Representation and Types for Resources: Competence
• Dictionaries, Thesauri, Ontologies: Performance
36 Open AIRE-COAR Conference Ananiadou