Анатолий Старостин (abbyy) "abbyy infoextractor: технология...

Post on 21-Jun-2015

380 Views

Category:

Technology

9 Downloads

Preview:

Click to see full reader

DESCRIPTION

Доклад посвящен описанию той части технологии ABBYY Compreno, с помощью которой разрабатываются предметно-ориентированные системы извлечения информации из текстов. Обсуждаются принципы работы базового механизма извлечения информации а также инструментальная среда OntoDPS (Ontological Data Preparation System), позволяющая настраивать его под конкретные предметные области. Базовый механизм позволяет использовать для извлечения информации результаты полного семантико-синтаксического анализа текста и применять к ним продукционную систему правил извлечения информации. Система правил компонуется в рамках OntoDPS. Она неразрывно связана с онтологией той предметной области, для которой создается система извлечения информации. Особое внимание в докладе уделяется вопросам модульности и инкапсуляции. Демонстрируется то, как за счет декларативной природы правил извлечения информации становится возможным их гибкое переиспользование между системами извлечения информации. Обсуждаются также вопросы автоматизированного тестирования создаваемых систем. Акцент в докладе делается в большей степени на архитектурных и технологических решениях. Конкретные онтологические и лингвистические вопросы почти не затрагиваются. Для обсуждения деталей такого рода в рамках демо-сессии конференции AINL 2014 планируется демонстрация внутреннего устройства и работы конкретной системы извлечения информации из новостных текстов.

TRANSCRIPT

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

Starostin A.S.

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

2

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

NLP-technologies

Rule-based - technologies based on the use of hand-written language rules applicable to a particular task.

Statistics-based - technologies based on machine learning on large text corpora, labeled, or parallel.

Hybrid technology - connecting a variety of approaches, for example: Rule-based + Statistics-based.

Model-based - technologies based on the universal (complete) language modeling

ABBYY Compreno

3

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

Universal Semantic Hierarchy

• It’s a tree• Intermediate nodes

represent semantic classes (concepts)

• Leafs represent lexical classes

• Concrete lexemes are linked to lexical classes

• All nodes are labeled with grammar and semantic information (set of grammemes and set of semantemes)

4

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

5

Syntactic-Semantic tree

Google sold Motorola to Lenovo for $2.91 billion.

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

6

OWL ontology

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

RDF graph

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

IE development factory

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

Extraction algorithm

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

Parse subtree interpretation rules

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

Parse subtree interpretation rules

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

Identification rules

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

Type of statements

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

IE system production

Design Input: customer needs (unformal), text examples

(marked up or not) Output: OWL-ontology where every object is well-

documented Development Input: well-documented OWL-ontology, marked up text

examples Output: production system of rules

Testing Nightly testing (marked up corpora) Reclamations (pointed error examples)

All three activities within one framework, which is called OntoDPS

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

IE system design

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

IE system design (marked up text example)

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

IE system development: libraries

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

IE system development: projects

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

IE system development: reuse and customization

Adding new items to

dictionaries

Adding new instances to

ontologies

Reuse of libraries and rules

Complex rule customization

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

IE system testing: nightly testing

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

IE system testing: nightly testing

ABBYY InfoExtractor: technology of producing domain oriented information extraction systems

Thank you!Questions?

top related