adam bartusiak and jörg lässig | semantic processing for the conversion of unstructured documents...
Post on 14-Jan-2017
101 Views
Preview:
TRANSCRIPT
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
SEMANTiCS’16 - 13.09.2016Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
Semantic Processing for the Conversion of Unstructured Documents into Structured Information in the Enterprise Context
The NXTM research project
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Agenda
• Motivation
• The NXTM Project
• Data analysis
• Search Engine
• Representation Layer
• Use case
Adam Bartusiak M.Sc. : The NXTM research project 2/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Motivation
• unstructured data overload (80-90% of digital data)• unstructured data is rather intended for human consumption only• it holds useful knowledge that can be utilized for:
• trend analytics• decision support• problem solving• discovering new facts and relations
• it can improve knowledge management within enterprise• it helps SMEs gaining a sustainable competitive advantage on the market
Adam Bartusiak M.Sc. : The NXTM research project 3/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
The NXTM Project• cooperation project between HSZG and an IT company from Dresden• lifetime: January 2015 - October 2016
Adam Bartusiak M.Sc. : The NXTM research project 4/10
Goal:Improving SMEs’ processes for extracting valuable business information from UD
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
The NXTM Project• cooperation project between HSZG and an IT company from Dresden• lifetime: January 2015 - October 2016
Goal:
Adam Bartusiak M.Sc. : The NXTM research project 4/10
Improving SMEs’ processes for extracting valuable business information from UD
• extraction of structured data from unstructured data from multiple resources:• emails and text messages• MS Office and PDF documents• XML and HTML files
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
The NXTM Project• cooperation project between HSZG and an IT company from Dresden• lifetime: January 2015 - October 2016
Adam Bartusiak M.Sc. : The NXTM research project 4/10
Goal:Improving SMEs’ processes for extracting valuable business information from UD
• extraction of structured data from unstructured data from multiple resources:• emails and text messages• MS Office and PDF documents• XML and HTML files
• dynamic recognition and representation of linked information in documents
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
The NXTM Project• cooperation project between HSZG and an IT company from Dresden• lifetime: January 2015 - October 2016
Adam Bartusiak M.Sc. : The NXTM research project 4/10
Goal:Improving SMEs’ processes for extracting valuable business information from UD
• extraction of structured data from unstructured data from multiple resources:• emails and text messages• MS Office and PDF documents• XML and HTML files
• flexible and intuitive graphical user interface enabling easy access to the analyzed data
• dynamic recognition and representation of linked information in documents
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input Interface
1. import of documents as JAVA objects from the input pipeline
Adam Bartusiak M.Sc. : The NXTM research project 5/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input InterfaceNXTM Data and Text Analysis Engine
Metadata AnalysisText Extraction
SegmentationMorphology
Semantic AnalysisSimilarity Analysis
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
Adam Bartusiak M.Sc. : The NXTM research project 5/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input InterfaceNXTM Data and Text Analysis Engine
Metadata AnalysisText Extraction
SegmentationMorphology
Semantic AnalysisSimilarity Analysis
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
3. NL processing in chained analysis engines and annotating semantic information
Adam Bartusiak M.Sc. : The NXTM research project 5/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input InterfaceNXTM Data and Text Analysis Engine
Metadata AnalysisText Extraction
SegmentationMorphology
Semantic AnalysisSimilarity Analysis
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
3. NL processing in chained analysis engines and annotating semantic information
4. similarity calculation and document clustering
Adam Bartusiak M.Sc. : The NXTM research project 5/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input Interface Data Persistence LayerNXTM Data and Text Analysis Engine
Metadata AnalysisText Extraction
SegmentationMorphology
Semantic AnalysisSimilarity Analysis
DB Mapper
Clustering Engine
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
3. NL processing in chained analysis engines and annotating semantic information
4. similarity calculation and document clustering
5. storing extracted data in DB, updating search index
Adam Bartusiak M.Sc. : The NXTM research project 5/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input Interface Data Persistence LayerNXTM Data and Text Analysis Engine
Metadata AnalysisText Extraction
SegmentationMorphology
Semantic AnalysisSimilarity Analysis
Linked Open Data
Knowledge Integrator
DB Mapper
Clustering Engine
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
3. NL processing in chained analysis engines and annotating semantic information
4. similarity calculation and document clustering
5. storing extracted data in DB, updating search index
6. mapping annotated entities with LOD resources
Adam Bartusiak M.Sc. : The NXTM research project 5/10
NXTM Item
• ID • Type (DOC, ENT) • Attribute [] • … •
Attribute
• Predicate • Value (NXTM_Item_ID; String) • Provenance (NXTM_Item_ID) • Confidence • Access policy
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
3. NL processing in chained analysis engines and annotating semantic information
4. similarity calculation and document clustering
5. storing extracted data in DB, updating search index
6. mapping annotated entities with LOD resources
Adam Bartusiak M.Sc. : The NXTM research project 5/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Search Engine
Data Presistence Layer
Search query…
Semantic Search Machine
NXTM Search Layer
Field Value
ID NXTM_Item_ID
Content LuceneAnalyzer
Semantic SIREnAnalyzer
• direct queries to a DB for retrieving the analysed data is an inefficient way of searching information
• a semantic search machine can effectively search for hierarchical data
• search engine is still subject of research:• • •
Clustering Engine
Results…
Adam Bartusiak M.Sc. : The NXTM research project 6/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Representation Layer• search results are represented as an interactive graph with
nodes and edges• real time browsing of the graph enables the user to discover
other relevant sources of information and their dependencies• d3js.org java-script library
Standalone Frontend
Plugins & Apps
NXTM Representation Layer
Document
Abstract
Lorem ipsum dolor sit amet, consetetur s a d i p s c i n g e l i t r, sediam nonumy eirmod temport…
Updated: 03.01.2003
Entity
Type: PersonName: John SmithAuthor of: XYZ
Title: XYZ
Adam Bartusiak M.Sc. : The NXTM research project 7/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Use case
Lorem ipsum dolor sit amet, consetetur NY elitr, sed diam nonumy eirmod tempor invidunt ut labore et NY dolore magna aliquyam erat, NY sed diam voluptua. At vero eos et accusam et justo duo dolores NY et ea rebum. Stet
#1
ipsum dolor sit amet. Lorem NY ipsum dolor sit a m e t , c o n s e t e t u r sadipscing el i tr, sed diam nonumy eirmod tempor invidunt ut labore e t d o l o r e m a g n a aliquyam erat, sed diam voluptua. At vero eos et
#2
Entity
• ID #301 • type PLACE • name NY (#1) • name NY (#2)
Metadata • createdIn NY
NXTM System
NXTM Item
• ID #1 • Type DOC • Attribute [] (Metadata)
NXTM Item
• ID #2 • Type DOC • Attribute [] (Metadata)
NXTM Item
• ID #301 • Type ENT • Attribute [] (Metadata)
NXTM DB
Adam Bartusiak M.Sc. : The NXTM research project 8/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Use case cont.
Query: New York
NXTM Results
ResultItem
• NXTM_ITEM_ID #1 • Score • Attribute []
ResultItem
• NXTM_ITEM_ID #2 • Score • Attribute []
Result Item
• NXTM_ITEM_ID #301 • Score • Attribute []
Result TriplesSource; Target; DistanceResultItem#1; ResultItem#2; DOC-DOC
ResultItem#1; ResultItem#3; DOC-ENT
ResultItem#2; ResultItem#3; DOC-ENT ENT #301
DOC #1
DOC #2
• DOC-DOC -> f(TF*IDF Similarity, Lucene score)• DOC-ENT -> f(Confidence score, Lucene score)
AdamBartusiak …
Person
DOC#45
Metadata
Keywords
Adam Bartusiak M.Sc. : The NXTM research project 9/10
Enterprise Application Development GroupUniversity of Applied Sciences Zittau/Görlitz
The NXTM ProjectDevelopment of a technology for live analysis of datastreams with regard to semantics and cross-linked datastructures
Adam Bartusiak M.Sc.University of Applied Sciences Zittau/Görlitz
January 7, 2015
Questions
Partners/Cooperations
a.bartusiak@hszg.de | ead.hszg.de
top related