automatic term recognition with apache solr

1
There is a real lack of open source tools to facilitate the development of downstream applications, also encouraging code- reuse, comparative studies, and fostering further research Existing tools are developed under different scenarios and evaluated in different domains using proprietary language resources, making it difficult for comparison. It is unclear whether and how well these tools can adapt to different domain tasks and scale up to large data. Automatic Term Extraction (ATE/ATR) is an important Natural Language Processing (NLP) task that deals with the extraction of terminologies from domain-specific textual corpora. Widely used by both industries and researchers in many complex tasks, such as Information Retrieval (IR), machine translation, ontology engineering and text summarization (Bowker, 2003; Brewster et al., 2007; Maynard et al., 2007). JATE2.0 Architecture Automatic Term Recognition with Apache Solr Ziqi Zhang and Jie Gao 1. JATE2.0 is an open-source library (under LGPLv3 license), available to download via https://github.com/ziqizhang/jate Contact: Jie Gao [email protected] 2. For more examples of JATE2.0 usage scenarios and ideas, please refer to JATE2.0 wiki. Ziqi Zhang [email protected] OAK Group, Department of Computer Science, University of Sheffield, Sheffield, S1 4DP, United Kingdom Example setting of Part-of-Speech (PoS) pattern based candidate extraction Acknowledgements Unique Features Use cases Usage Modes ATE algorithms in JATE2.0 (beta) Evaluation Two datasets, GENIA dataset (Kim et al., 2003) containing 1,999 medline abstract corpus for bio-textmining previously used by (Zhang et al., 2008); and the ACL RD-TEC dataset (Zadeh and Handschuh, 2014), containing over 10,900 publications in the domain of computational linguistics 3 types of candidate extractors are tested (NP, N-gram, POS pattern) Overall recall, precision at Top K, and CPU time are measured Figure 5: Comparison of Top K precisions on ACL RD-TEC Part of this research has been sponsored by the EU funded project WeSenseIt under grant agreement number 308429; and the SPEEAK-PC collaboration agreement 101947 of the Innovative UK. Terminology-driven Faceted Search for interactive cause analysis ATE in combination with sentiment analysis ATE used to improve sentiment analysis used by homeland security forces (both English and Italian) Training corpus collection and annotation based on distant supervision ATE for text normalization & standardization, key term extraction (uni-/bi-gram) from corpus Key terms used as features to train sentiment classifiers (SVM, Naïve Bayes, logistic regression) JATE2.0 for Translation ATE is a very useful starting point for a human terminologist or translator. JATE2.0 can work with very large corpus efficiently. It is also easy-to-use and highly configurable for various different domains and languages. With more than 10 algorithms, JATE2.0 can be simply used to process a large corpus as input. Important/Domain-specific terms will be identified, extracted, normalised, ranked and exported with scores into a external file. JATE2.0 for knowledge engineering JATE2.0 can be used as concept extraction tool to support the creation of a domain ontology or a terminology base directly from text corpus. Users can take domain-specific corpus as input and use JATE2.0 to generate normalised candidate terms/ concepts as a starting points for further ontology engineering. Future version will support to import output to Protege or work as a plugin to Protege. To bring both academic and industries under a uniform development and benchmark framework that addresses : Adaptability Scalability High configurability and extensibility Solution: JATE 2.0 integrates with Apache Solr framework to benefit from its extensive, extensible, flexible text processing libraries; it can either be used as a separate module, or as a Solr plugin used during document processing to enrich the indexed documents with candidate terms. Expands JATE 1.0 collection of state-of-the-art algorithms, which are not available in any other tools; Linguistic processors (candidate term extraction) are highly customizable and developed as Solr plugins, hence making JATE2.0 adaptable to many different domains and languages; Two usage modes for various usage scenarios and can directly apply to digital archive (for both indexed and not indexed) in industry; Embedded mode: as a standalone application from command line. This mode is recommended when users need a list of candidate terms extracted from a corpus so as to support subsequent knowledge engineering task. Plugin mode: works as a Solr plugin. This mode is recommended when users need to index new or enrich existing index with candidate terms, which can, e.g., support faceted search, boost query (implemented as a custom request handler that processes term extraction by a simply HTTP request) Introduction Objective Photo credit to K-NOW 1 parses ingested documents to raw text content and performs character level normalisation 2 ‘Cleansed’ text then passed through the candidate extraction component (as a Solr analyzer chain) 3 Candidate terms loaded from Solr index and processed by the subsequent filtering component, where different ATE algorithms can be configured 4 candidate terms can be indexed or exported to support specific use cases (e.g., faceted query, knowledge base construction) Figure 4: Comparison of Top K precisions on GENIA TATA Steel Scenario: cause analysis via text analytics To understand the types of potential factors and actions that lead to product failures Users (domain expert) collect, select unstructured documentations (e.g., Lotus notes) from various data sources JATE 2.0 applied to the documents to extract industrial terms for analyzing and linking domain relevant concepts from textual data Terms used to enable dynamic faceted search/navigation for concept- driven text analytics

Upload: jie-gao

Post on 21-Mar-2017

60 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Automatic Term Recognition with Apache Solr

•  There is a real lack of open source tools to facilitate the development of downstream applications, also encouraging code-reuse, comparative studies, and fostering further research

•  Existing tools are developed under different scenarios and evaluated in different domains using proprietary language resources, making it difficult for comparison.

•  It is unclear whether and how well these tools can adapt to different domain tasks and scale up to large data.

•  Automatic Term Extraction (ATE/ATR) is an important Natural Language Processing (NLP) task that deals with the extraction of terminologies from domain-specific textual corpora.

•  Widely used by both industries and researchers in many complex tasks, such as Information Retrieval (IR), machine translation, ontology engineering and text summarization (Bowker, 2003; Brewster et al., 2007; Maynard et al., 2007). JATE2.0 Architecture

Automatic Term Recognition with Apache Solr Ziqi Zhang and Jie Gao

1.  JATE2.0 is an open-source library (under LGPLv3 license), available to download via https://github.com/ziqizhang/jate Contact: Jie Gao [email protected] 2.  For more examples of JATE2.0 usage scenarios and ideas, please refer to JATE2.0 wiki. Ziqi Zhang [email protected]

OAK Group, Department of Computer Science, University of Sheffield, Sheffield, S1 4DP, United Kingdom

Example setting of Part-of-Speech (PoS) pattern based candidate extraction

Acknowledgements

Unique Features

Use cases

Usage Modes

ATE algorithms in JATE2.0 (beta)

Evaluation •  Two datasets, GENIA dataset (Kim et al., 2003)

containing 1,999 medline abstract corpus for bio-textmining previously used by (Zhang et al., 2008); and the ACL RD-TEC dataset (Zadeh and Handschuh, 2014), containing over 10,900 publications in the domain of computational linguistics

•  3 types of candidate extractors are tested (NP, N-gram, POS pattern)

•  Overall recall, precision at Top K, and CPU time are measured

Figure 5: Comparison of Top K precisions on ACL RD-TEC

Part of this research has been sponsored by the EU funded project WeSenseIt under grant agreement number 308429; and the SPEEAK-PC collaboration agreement 101947 of the Innovative UK.

Terminology-driven Faceted Search for interactive cause analysis

ATE in combination with sentiment analysis •  ATE used to improve sentiment analysis used by homeland

security forces (both English and Italian) •  Training corpus collection and annotation based on

distant supervision •  ATE for text normalization & standardization, key term

extraction (uni-/bi-gram) from corpus •  Key terms used as features to train sentiment classifiers

(SVM, Naïve Bayes, logistic regression)

JATE2.0 for Translation ATE is a very useful starting point for a human terminologist or translator. JATE2.0 can work with very large corpus efficiently. It is also easy-to-use and highly configurable for various different domains and languages. With more than 10 algorithms, JATE2.0 can be simply used to process a large corpus as input. Important/Domain-specific terms will be identified, extracted, normalised, ranked and exported with scores into a external file.

JATE2.0 for knowledge engineering JATE2.0 can be used as concept extraction tool to support the creation of a domain ontology or a terminology base directly from text corpus. Users can take domain-specific corpus as input and use JATE2.0 to generate normalised candidate terms/concepts as a starting points for further ontology engineering. Future version will support to import output to Protege or work as a plugin to Protege.

To bring both academic and industries under a uniform development and benchmark framework that addresses : •  Adaptability •  Scalability •  High configurability and extensibility Solution: JATE 2.0 integrates with Apache Solr framework to benefit from its extensive, extensible, flexible text processing libraries; it can either be used as a separate module, or as a Solr plugin used during document processing to enrich the indexed documents with candidate terms.

•  Expands JATE 1.0 collection of state-of-the-art algorithms, which are not available in any other tools;

•  Linguistic processors (candidate term extraction) are highly customizable and developed as Solr plugins, hence making JATE2.0 adaptable to many different domains and languages;

•  Two usage modes for various usage scenarios and can directly apply to digital archive (for both indexed and not indexed) in industry;

Embedded mode: as a standalone application from command line. This mode is recommended when users need a list of candidate terms extracted from a corpus so as to support subsequent knowledge engineering task. Plugin mode: works as a Solr plugin. This mode is recommended when users need to index new or enrich existing index with candidate terms, which can, e.g., support faceted search, boost query (implemented as a custom request handler that processes term extraction by a simply HTTP request)

Introduction

Objective

Photo credit to K-NOW

1 parses ingested documents to raw text content and performs character level normalisation

2 ‘Cleansed’ text then passed through the candidate extraction component (as a Solr analyzer chain)

3 Candidate terms loaded from Solr index and processed by the subsequent filtering component, where different ATE algorithms can be configured

4 candidate terms can be indexed or exported to support specific use cases (e.g., faceted query, knowledge base construction)

Figure 4: Comparison of Top K precisions on GENIA

•  TATA Steel Scenario: cause analysis via text analytics •  To understand the types of potential factors and actions that lead

to product failures •  Users (domain expert) collect, select unstructured

documentations (e.g., Lotus notes) from various data sources •  JATE 2.0 applied to the documents to extract industrial terms for

analyzing and linking domain relevant concepts from textual data •  Terms used to enable dynamic faceted search/navigation for concept-

driven text analytics