automatic term recognition with apache solr

•  There is a real lack of open source tools to facilitate the development of downstream applications, also encouraging code-reuse, comparative studies, and fostering further research

•  Existing tools are developed under different scenarios and evaluated in different domains using proprietary language resources, making it difficult for comparison.

•  It is unclear whether and how well these tools can adapt to different domain tasks and scale up to large data.

•  Automatic Term Extraction (ATE/ATR) is an important Natural Language Processing (NLP) task that deals with the extraction of terminologies from domain-specific textual corpora.

•  Widely used by both industries and researchers in many complex tasks, such as Information Retrieval (IR), machine translation, ontology engineering and text summarization (Bowker, 2003; Brewster et al., 2007; Maynard et al., 2007). JATE2.0 Architecture

Automatic Term Recognition with Apache Solr Ziqi Zhang and Jie Gao

1.  JATE2.0 is an open-source library (under LGPLv3 license), available to download via https://github.com/ziqizhang/jate Contact: Jie Gao [email protected] 2.  For more examples of JATE2.0 usage scenarios and ideas, please refer to JATE2.0 wiki. Ziqi Zhang [email protected]

OAK Group, Department of Computer Science, University of Sheffield, Sheffield, S1 4DP, United Kingdom

Example setting of Part-of-Speech (PoS) pattern based candidate extraction

Acknowledgements

Unique Features

Use cases

Usage Modes

ATE algorithms in JATE2.0 (beta)

Evaluation •  Two datasets, GENIA dataset (Kim et al., 2003)

containing 1,999 medline abstract corpus for bio-textmining previously used by (Zhang et al., 2008); and the ACL RD-TEC dataset (Zadeh and Handschuh, 2014), containing over 10,900 publications in the domain of computational linguistics

•  3 types of candidate extractors are tested (NP, N-gram, POS pattern)

•  Overall recall, precision at Top K, and CPU time are measured

Figure 5: Comparison of Top K precisions on ACL RD-TEC

Part of this research has been sponsored by the EU funded project WeSenseIt under grant agreement number 308429; and the SPEEAK-PC collaboration agreement 101947 of the Innovative UK.

Terminology-driven Faceted Search for interactive cause analysis

ATE in combination with sentiment analysis •  ATE used to improve sentiment analysis used by homeland

security forces (both English and Italian) •  Training corpus collection and annotation based on

distant supervision •  ATE for text normalization & standardization, key term

extraction (uni-/bi-gram) from corpus •  Key terms used as features to train sentiment classifiers

(SVM, Naïve Bayes, logistic regression)

JATE2.0 for Translation ATE is a very useful starting point for a human terminologist or translator. JATE2.0 can work with very large corpus efficiently. It is also easy-to-use and highly configurable for various different domains and languages. With more than 10 algorithms, JATE2.0 can be simply used to process a large corpus as input. Important/Domain-specific terms will be identified, extracted, normalised, ranked and exported with scores into a external file.

JATE2.0 for knowledge engineering JATE2.0 can be used as concept extraction tool to support the creation of a domain ontology or a terminology base directly from text corpus. Users can take domain-specific corpus as input and use JATE2.0 to generate normalised candidate terms/concepts as a starting points for further ontology engineering. Future version will support to import output to Protege or work as a plugin to Protege.

To bring both academic and industries under a uniform development and benchmark framework that addresses : •  Adaptability •  Scalability •  High configurability and extensibility Solution: JATE 2.0 integrates with Apache Solr framework to benefit from its extensive, extensible, flexible text processing libraries; it can either be used as a separate module, or as a Solr plugin used during document processing to enrich the indexed documents with candidate terms.

•  Expands JATE 1.0 collection of state-of-the-art algorithms, which are not available in any other tools;

•  Linguistic processors (candidate term extraction) are highly customizable and developed as Solr plugins, hence making JATE2.0 adaptable to many different domains and languages;

•  Two usage modes for various usage scenarios and can directly apply to digital archive (for both indexed and not indexed) in industry;

Embedded mode: as a standalone application from command line. This mode is recommended when users need a list of candidate terms extracted from a corpus so as to support subsequent knowledge engineering task. Plugin mode: works as a Solr plugin. This mode is recommended when users need to index new or enrich existing index with candidate terms, which can, e.g., support faceted search, boost query (implemented as a custom request handler that processes term extraction by a simply HTTP request)

Introduction

Objective

Photo credit to K-NOW

1 parses ingested documents to raw text content and performs character level normalisation

2 ‘Cleansed’ text then passed through the candidate extraction component (as a Solr analyzer chain)

3 Candidate terms loaded from Solr index and processed by the subsequent filtering component, where different ATE algorithms can be configured

4 candidate terms can be indexed or exported to support specific use cases (e.g., faceted query, knowledge base construction)

Figure 4: Comparison of Top K precisions on GENIA

•  TATA Steel Scenario: cause analysis via text analytics •  To understand the types of potential factors and actions that lead

to product failures •  Users (domain expert) collect, select unstructured

documentations (e.g., Lotus notes) from various data sources •  JATE 2.0 applied to the documents to extract industrial terms for

analyzing and linking domain relevant concepts from textual data •  Terms used to enable dynamic faceted search/navigation for concept-

driven text analytics

automatic term recognition with apache solr

Engineering