adam bartusiak and jörg lässig | semantic processing for the conversion of unstructured documents...

19
Enterprise Application Development Group University of Applied Sciences Zittau/Görlitz SEMANTiCS’16 - 13.09.2016 Adam Bartusiak M.Sc. University of Applied Sciences Zittau/Görlitz Semantic Processing for the Conversion of Unstructured Documents into Structured Information in the Enterprise Context The NXTM research project

Upload: semanticsconference

Post on 14-Jan-2017

101 views

Category:

Technology


1 download

TRANSCRIPT

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

SEMANTiCS’16 - 13.09.2016Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

Semantic Processing for the Conversion of Unstructured Documents into Structured Information in the Enterprise Context

The NXTM research project

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Agenda

• Motivation

• The NXTM Project

• Data analysis

• Search Engine

• Representation Layer

• Use case

Adam Bartusiak M.Sc. : The NXTM research project 2/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Motivation

• unstructured data overload (80-90% of digital data)• unstructured data is rather intended for human consumption only• it holds useful knowledge that can be utilized for:

• trend analytics• decision support• problem solving• discovering new facts and relations

• it can improve knowledge management within enterprise• it helps SMEs gaining a sustainable competitive advantage on the market

Adam Bartusiak M.Sc. : The NXTM research project 3/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

The NXTM Project• cooperation project between HSZG and an IT company from Dresden• lifetime: January 2015 - October 2016

Adam Bartusiak M.Sc. : The NXTM research project 4/10

Goal:Improving SMEs’ processes for extracting valuable business information from UD

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

The NXTM Project• cooperation project between HSZG and an IT company from Dresden• lifetime: January 2015 - October 2016

Goal:

Adam Bartusiak M.Sc. : The NXTM research project 4/10

Improving SMEs’ processes for extracting valuable business information from UD

• extraction of structured data from unstructured data from multiple resources:• emails and text messages• MS Office and PDF documents• XML and HTML files

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

The NXTM Project• cooperation project between HSZG and an IT company from Dresden• lifetime: January 2015 - October 2016

Adam Bartusiak M.Sc. : The NXTM research project 4/10

Goal:Improving SMEs’ processes for extracting valuable business information from UD

• extraction of structured data from unstructured data from multiple resources:• emails and text messages• MS Office and PDF documents• XML and HTML files

• dynamic recognition and representation of linked information in documents

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

The NXTM Project• cooperation project between HSZG and an IT company from Dresden• lifetime: January 2015 - October 2016

Adam Bartusiak M.Sc. : The NXTM research project 4/10

Goal:Improving SMEs’ processes for extracting valuable business information from UD

• extraction of structured data from unstructured data from multiple resources:• emails and text messages• MS Office and PDF documents• XML and HTML files

• flexible and intuitive graphical user interface enabling easy access to the analyzed data

• dynamic recognition and representation of linked information in documents

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Data analysis

Data Input Interface

1. import of documents as JAVA objects from the input pipeline

Adam Bartusiak M.Sc. : The NXTM research project 5/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Data analysis

Data Input InterfaceNXTM Data and Text Analysis Engine

Metadata AnalysisText Extraction

SegmentationMorphology

Semantic AnalysisSimilarity Analysis

1. import of documents as JAVA objects from the input pipeline

2. language identification, MIME-Type and metadata analysis

Adam Bartusiak M.Sc. : The NXTM research project 5/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Data analysis

Data Input InterfaceNXTM Data and Text Analysis Engine

Metadata AnalysisText Extraction

SegmentationMorphology

Semantic AnalysisSimilarity Analysis

1. import of documents as JAVA objects from the input pipeline

2. language identification, MIME-Type and metadata analysis

3. NL processing in chained analysis engines and annotating semantic information

Adam Bartusiak M.Sc. : The NXTM research project 5/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Data analysis

Data Input InterfaceNXTM Data and Text Analysis Engine

Metadata AnalysisText Extraction

SegmentationMorphology

Semantic AnalysisSimilarity Analysis

1. import of documents as JAVA objects from the input pipeline

2. language identification, MIME-Type and metadata analysis

3. NL processing in chained analysis engines and annotating semantic information

4. similarity calculation and document clustering

Adam Bartusiak M.Sc. : The NXTM research project 5/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Data analysis

Data Input Interface Data Persistence LayerNXTM Data and Text Analysis Engine

Metadata AnalysisText Extraction

SegmentationMorphology

Semantic AnalysisSimilarity Analysis

DB Mapper

Clustering Engine

1. import of documents as JAVA objects from the input pipeline

2. language identification, MIME-Type and metadata analysis

3. NL processing in chained analysis engines and annotating semantic information

4. similarity calculation and document clustering

5. storing extracted data in DB, updating search index

Adam Bartusiak M.Sc. : The NXTM research project 5/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Data analysis

Data Input Interface Data Persistence LayerNXTM Data and Text Analysis Engine

Metadata AnalysisText Extraction

SegmentationMorphology

Semantic AnalysisSimilarity Analysis

Linked Open Data

Knowledge Integrator

DB Mapper

Clustering Engine

1. import of documents as JAVA objects from the input pipeline

2. language identification, MIME-Type and metadata analysis

3. NL processing in chained analysis engines and annotating semantic information

4. similarity calculation and document clustering

5. storing extracted data in DB, updating search index

6. mapping annotated entities with LOD resources

Adam Bartusiak M.Sc. : The NXTM research project 5/10

NXTM Item

• ID • Type (DOC, ENT) • Attribute [] • … •

Attribute

• Predicate • Value (NXTM_Item_ID; String) • Provenance (NXTM_Item_ID) • Confidence • Access policy

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Data analysis1. import of documents as JAVA objects from the input pipeline

2. language identification, MIME-Type and metadata analysis

3. NL processing in chained analysis engines and annotating semantic information

4. similarity calculation and document clustering

5. storing extracted data in DB, updating search index

6. mapping annotated entities with LOD resources

Adam Bartusiak M.Sc. : The NXTM research project 5/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Search Engine

Data Presistence Layer

Search query…

Semantic Search Machine

NXTM Search Layer

Field Value

ID NXTM_Item_ID

Content LuceneAnalyzer

Semantic SIREnAnalyzer

• direct queries to a DB for retrieving the analysed data is an inefficient way of searching information

• a semantic search machine can effectively search for hierarchical data

• search engine is still subject of research:• • •

Clustering Engine

Results…

Adam Bartusiak M.Sc. : The NXTM research project 6/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Representation Layer• search results are represented as an interactive graph with

nodes and edges• real time browsing of the graph enables the user to discover

other relevant sources of information and their dependencies• d3js.org java-script library

Standalone Frontend

Plugins & Apps

NXTM Representation Layer

Document

Abstract

Lorem ipsum dolor sit amet, consetetur s a d i p s c i n g e l i t r, sediam nonumy eirmod temport…

Updated: 03.01.2003

Entity

Type: PersonName: John SmithAuthor of: XYZ

Title: XYZ

Adam Bartusiak M.Sc. : The NXTM research project 7/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Use case

Lorem ipsum dolor sit amet, consetetur NY elitr, sed diam nonumy eirmod tempor invidunt ut labore et NY dolore magna aliquyam erat, NY sed diam voluptua. At vero eos et accusam et justo duo dolores NY et ea rebum. Stet

#1

ipsum dolor sit amet. Lorem NY ipsum dolor sit a m e t , c o n s e t e t u r sadipscing el i tr, sed diam nonumy eirmod tempor invidunt ut labore e t d o l o r e m a g n a aliquyam erat, sed diam voluptua. At vero eos et

#2

Entity

• ID #301 • type PLACE • name NY (#1) • name NY (#2)

Metadata • createdIn NY

NXTM System

NXTM Item

• ID #1 • Type DOC • Attribute [] (Metadata)

NXTM Item

• ID #2 • Type DOC • Attribute [] (Metadata)

NXTM Item

• ID #301 • Type ENT • Attribute [] (Metadata)

NXTM DB

Adam Bartusiak M.Sc. : The NXTM research project 8/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Use case cont.

Query: New York

NXTM Results

ResultItem

• NXTM_ITEM_ID #1 • Score • Attribute []

ResultItem

• NXTM_ITEM_ID #2 • Score • Attribute []

Result Item

• NXTM_ITEM_ID #301 • Score • Attribute []

Result TriplesSource; Target; DistanceResultItem#1; ResultItem#2; DOC-DOC

ResultItem#1; ResultItem#3; DOC-ENT

ResultItem#2; ResultItem#3; DOC-ENT ENT #301

DOC #1

DOC #2

• DOC-DOC -> f(TF*IDF Similarity, Lucene score)• DOC-ENT -> f(Confidence score, Lucene score)

AdamBartusiak …

Person

DOC#45

Metadata

Keywords

Adam Bartusiak M.Sc. : The NXTM research project 9/10

Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz

The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures

Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz

January 7, 2015

Questions

Partners/Cooperations

[email protected] | ead.hszg.de