automatic term recognition with apache solr
TRANSCRIPT
• There is a real lack of open source tools to facilitate the development of downstream applications, also encouraging code-reuse, comparative studies, and fostering further research
• Existing tools are developed under different scenarios and evaluated in different domains using proprietary language resources, making it difficult for comparison.
• It is unclear whether and how well these tools can adapt to different domain tasks and scale up to large data.
• Automatic Term Extraction (ATE/ATR) is an important Natural Language Processing (NLP) task that deals with the extraction of terminologies from domain-specific textual corpora.
• Widely used by both industries and researchers in many complex tasks, such as Information Retrieval (IR), machine translation, ontology engineering and text summarization (Bowker, 2003; Brewster et al., 2007; Maynard et al., 2007). JATE2.0 Architecture
Automatic Term Recognition with Apache Solr Ziqi Zhang and Jie Gao
1. JATE2.0 is an open-source library (under LGPLv3 license), available to download via https://github.com/ziqizhang/jate Contact: Jie Gao [email protected] 2. For more examples of JATE2.0 usage scenarios and ideas, please refer to JATE2.0 wiki. Ziqi Zhang [email protected]
OAK Group, Department of Computer Science, University of Sheffield, Sheffield, S1 4DP, United Kingdom
Example setting of Part-of-Speech (PoS) pattern based candidate extraction
Acknowledgements
Unique Features
Use cases
Usage Modes
ATE algorithms in JATE2.0 (beta)
Evaluation • Two datasets, GENIA dataset (Kim et al., 2003)
containing 1,999 medline abstract corpus for bio-textmining previously used by (Zhang et al., 2008); and the ACL RD-TEC dataset (Zadeh and Handschuh, 2014), containing over 10,900 publications in the domain of computational linguistics
• 3 types of candidate extractors are tested (NP, N-gram, POS pattern)
• Overall recall, precision at Top K, and CPU time are measured
Figure 5: Comparison of Top K precisions on ACL RD-TEC
Part of this research has been sponsored by the EU funded project WeSenseIt under grant agreement number 308429; and the SPEEAK-PC collaboration agreement 101947 of the Innovative UK.
Terminology-driven Faceted Search for interactive cause analysis
ATE in combination with sentiment analysis • ATE used to improve sentiment analysis used by homeland
security forces (both English and Italian) • Training corpus collection and annotation based on
distant supervision • ATE for text normalization & standardization, key term
extraction (uni-/bi-gram) from corpus • Key terms used as features to train sentiment classifiers
(SVM, Naïve Bayes, logistic regression)
JATE2.0 for Translation ATE is a very useful starting point for a human terminologist or translator. JATE2.0 can work with very large corpus efficiently. It is also easy-to-use and highly configurable for various different domains and languages. With more than 10 algorithms, JATE2.0 can be simply used to process a large corpus as input. Important/Domain-specific terms will be identified, extracted, normalised, ranked and exported with scores into a external file.
JATE2.0 for knowledge engineering JATE2.0 can be used as concept extraction tool to support the creation of a domain ontology or a terminology base directly from text corpus. Users can take domain-specific corpus as input and use JATE2.0 to generate normalised candidate terms/concepts as a starting points for further ontology engineering. Future version will support to import output to Protege or work as a plugin to Protege.
To bring both academic and industries under a uniform development and benchmark framework that addresses : • Adaptability • Scalability • High configurability and extensibility Solution: JATE 2.0 integrates with Apache Solr framework to benefit from its extensive, extensible, flexible text processing libraries; it can either be used as a separate module, or as a Solr plugin used during document processing to enrich the indexed documents with candidate terms.
• Expands JATE 1.0 collection of state-of-the-art algorithms, which are not available in any other tools;
• Linguistic processors (candidate term extraction) are highly customizable and developed as Solr plugins, hence making JATE2.0 adaptable to many different domains and languages;
• Two usage modes for various usage scenarios and can directly apply to digital archive (for both indexed and not indexed) in industry;
Embedded mode: as a standalone application from command line. This mode is recommended when users need a list of candidate terms extracted from a corpus so as to support subsequent knowledge engineering task. Plugin mode: works as a Solr plugin. This mode is recommended when users need to index new or enrich existing index with candidate terms, which can, e.g., support faceted search, boost query (implemented as a custom request handler that processes term extraction by a simply HTTP request)
Introduction
Objective
Photo credit to K-NOW
1 parses ingested documents to raw text content and performs character level normalisation
2 ‘Cleansed’ text then passed through the candidate extraction component (as a Solr analyzer chain)
3 Candidate terms loaded from Solr index and processed by the subsequent filtering component, where different ATE algorithms can be configured
4 candidate terms can be indexed or exported to support specific use cases (e.g., faceted query, knowledge base construction)
Figure 4: Comparison of Top K precisions on GENIA
• TATA Steel Scenario: cause analysis via text analytics • To understand the types of potential factors and actions that lead
to product failures • Users (domain expert) collect, select unstructured
documentations (e.g., Lotus notes) from various data sources • JATE 2.0 applied to the documents to extract industrial terms for
analyzing and linking domain relevant concepts from textual data • Terms used to enable dynamic faceted search/navigation for concept-
driven text analytics