information processing michal laclavík, ladislav hluchý (email research, information extraction,...

Information processingInformation processing

Michal Laclavík, Ladislav Hluchý

(Email research, information extraction, information retrieval, contextual recommendation)

Bratislava, 10th November 2011 2

Primary Research Team & CapabilitiesPrimary Research Team & Capabilities

Dept. of Parallel and Distributed ComputingResearch and Development Areas:

– Large-scale HPCN, Grid and MapReduce applications– Intelligent and Knowledge oriented Technologies

Experience from IST:– 3 project in FP5: ANFAS, CrosGRID, Pellucid– 6 project in FP6: EGEE II, K-Wf Grid, DEGREE

(coordinator), EGEE, int.eu.grid, MEDIGRID– 4 projects in FP7: Commius, Admire, Secricom, EGEE III

Several National Projects (SPVV, VEGA, APVT)IKT Group Focus:

– Information Processing (Large Scale)– Graph Processing – Information Extraction and Retrieval– Semantic Web– Knowledge oriented Technologies– Parallel and Distributed Information Processing

Solutions:– SGDB: Simple Graph Database– gSemSearch: Graph based Semantic Search– Ontea: Pattern-based Semantic Annotation– ACoMA: KM tool in Email– EMBET: Recommendation System– Experts on MapReduce and IR (Nutch, Solr, Lucene)

Director & leader of PDC: Dr. Dipl. Ing. Ladislav Hluchý

URL: http://ikt.ui.sav.sk

Approach and SolutionsApproach and Solutions

Large scale Text and Graph data processingLarge scale Text and Graph data processing

Core Technology• Web crawling

– Nutch + plugins

• Full text indexing and search– lucene, Sorl

• Information Extraction– Ontea, GATE

• All above large scale– Hadoop, S4

• Graph processing and Querying– Simple Graph Database (SGDB)

– gSemSearch

– Neo4j

– Blueprints


Underlined are the technologies developed by IISAS

Ontea: Information Extraction ToolOntea: Information Extraction Tool

Regex patternsGazetteersResuls

Key-value pairs Structured into trees graphs

Transformers, ConfigurationAutomatic loading of extractors

Visual Annotation Tool Integration with external tools

GATE, Stemers, Hadoop …Multilingual tests

English, Slovak, Spanish, Italian


http://ontea.sf.net

• Use of Social Network from email• Includes extracted objects• Full text of extracted objects• Related objects discovered and

ordered by spread activation on social network graph

• Faceted search, navigation

Email Search PrototypeEmail Search Prototype


gSemSearch: Graph based Semantic SearchgSemSearch: Graph based Semantic Search

• Graph/Network of interacting (interconnected) entities• Discovering relation in the Graph (network) using spread of activation algorithm• Showing relations of concrete type, e.g. telephone numbers related to a person• Navigation over related entities• Full-text search of the entities• User interface for search• User interaction with data (merging,

deleting entities) with immediate impact on discovered relations

• Tested on Email Enron Corpus– Email Social Network Search– http://ikt.ui.sav.sk/esns/


SGDB: Simple Graph DatabaseSGDB: Simple Graph Database

• Storage for graphs• Optimized for graph traversing and spread of activation• Faster then Neo4j for graph traversing operations• Supports Blueprints API• https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3

• Graph Database Benchmarks– Graph Traversal Benchmark for Graph Databases

– http://ups.savba.sk/~marek/gbench.html

– Blueprints API - possibility to test compliant Graph databases


Future Direction: Relations Discovery in Large Graph DataFuture Direction: Relations Discovery in Large Graph Data

• Motivation– Graph/Network data are everywhere: social networks, web, LinkedData,

transactions, communication (email, phone). – Also text can be converted to graph. – Interconnecting graph data and searching for relations is crucial.

• Approach– Forming semantic trees and graphs from text, web, communication, databases

and LinkedData– User interaction with graph data in order to achieve integration and data

cleansing– Users will do it, if user effort have immediate impact on search results


Ontea: Pattern based information extraction and Ontea: Pattern based information extraction and semantic annotationsemantic annotation

Text processing

Ontea: Information Extraction (Features)Ontea: Information Extraction (Features)

Regex patternsVisual Annotation Tool Integration with external tools

GATE, Stemers, Hadoop …Gazetteers IE System configurationAutomatic loading of extractorsPatternsMultilingual tests

Spanish Slovak English Italian


Information Extraction ModelInformation Extraction Model

Address and product patternsAddress and product patterns

ExtractionExtraction

ProcessingProcessing

3 words macro3 words macro

ZIP macroZIP macro

Street number macroStreet number macro

Street name macroStreet name macro

City name macroCity name macro

Country macroCountry macro

Address patternsAddress patterns


SegmentationSegmentation

• Sentences • Paragraphs• Objects (Address, Product ..)


GazetteerCan extract information, which

cannot be properly extracted by regular expression patterns (like given names, product names, etc.)

Gazetteer extraction approach is combined with regular expressions based extrac-tion. For example personal full names can be extracted with higher precision.

Gazetteer is easy to update, because it is configured by simple text files.

Information Extraction: Gazetteers configurationInformation Extraction: Gazetteers configuration


Gazetteer listssimple text files with keywords

Gazetteer configurationsimple text file with<list file>:<IE result type>

Information extractor rules

Information Extraction: Rules configurationInformation Extraction: Rules configuration

IE System configuration– IE dynamically loads and run its

components (XMLRegexExtractor, Gazetteer, RuleTransformer) according to setting in IE rules file

– IE Components are executing consecutively and operate on a set of information extraction results


Information extractor rules file

IE result setModified

IE result setIE component

Regex basedIE component

GazetteerIE component

Result set transformerIE component

Semantic AnnotationSemantic Annotation


TheThe conceptconcept InformationExtractor - IEInformationExtractor - IE produces produces a set of extraction resultsa set of extraction results

SemanticAnnotator - SASemanticAnnotator - SA consumes consumes the IE result set and builds a trees the IE result set and builds a trees convertible to Ontology instances or convertible to Ontology instances or objects according to XML schema e.g. objects according to XML schema e.g. Core ComponentsCore Components

SA first builds an intermediate tree of IE SA first builds an intermediate tree of IE results on which it operatesresults on which it operates

The tree is upon its creation not compliant The tree is upon its creation not compliant to Core Components specification and to Core Components specification and needs to be transformedneeds to be transformed

Therefore we have Therefore we have tree transformerstree transformers which transform the IE result tree to a treeswhich transform the IE result tree to a trees

Semantic AnnotationSemantic Annotation

• Tree transformers– Input is a tree of IE results and output is the modified tree of IE results

– Tree transformers are executing consecutively and operate on a tree of information extraction results

– Tree transformers, which delete, create,rename, move, switch and order nodesare configured in the SA rules file


Treetransformer

Social NetworksSocial Networks

Social nework reconstruction:probabilistic inference using spreading

activationrelies on the output of the information

extractor (IE) in the form of complex objects


Preliminary results on a set of Preliminary results on a set of 50 Spanish emails (phone/name):50 Spanish emails (phone/name):Precision 60% Precision 60% (due to lower recall in IE)(due to lower recall in IE)Precision 85% Precision 85% (achievable with better IE)(achievable with better IE)self-healing self-healing (with new incoming emails)(with new incoming emails)

Social NetworksSocial Networks


Results as XML or HTML: Results as XML or HTML: (via XSL Transformations)(via XSL Transformations)

Future:Future:

DataSource for Search DataSource for Search for Partner modulefor Partner module

Improve the recall of Improve the recall of Information ExtractorInformation Extractor

Exploit multi-pass algorithm and named entity recognition: things Exploit multi-pass algorithm and named entity recognition: things learned in the first pass will be used in the next, e.g. possible names learned in the first pass will be used in the next, e.g. possible names with initials, etc.with initials, etc.

Build an enhanced statistical reasoning procedure on top of the Build an enhanced statistical reasoning procedure on top of the present Social Network Extractor/Correlatorpresent Social Network Extractor/Correlator

Email ResearchEmail Research

Acoma


Acoma ArchitectureAcoma Architecture

• Connected to email protocols on desktop or server• No need to change working practices

– Emails are received and send as before

• Received email is processed by Acoma and enriched with useful information

• Extensible with OSGi modulesS

erverD

esktop Mail Client

Browser

Mail Server

POP3IMAP

Acoma

Se

rve

rD

es

kto

p Mail Client

Browser

Mail ServerSMTP

Acoma

Information Processing and Extraction

Mail Server

Modified

Co

nn

ector to

Em

ail Infrastru

cture

System Connectors

Hint Recomendation

Mo

du

le 1

Mo

du

le 2

Mo

du

le n

Mail Client

Browser


System ConnectorsSystem Connectors

• Connection of Acoma to existing systems– Document Archives– Internet or Intranet Systems– Databases

• Access or import of data • Key-value pair transformation

Meta-Connector

Web Connector

SpreadSheet Connector

Database Connector

Internet

Key-value

TransformedKey-value


Acoma architecture : Message Post ProcessingAcoma architecture : Message Post Processing

• Useful hints with links are included in enriched email

• Links lead to internal or external systems (Internet, Intranet)


Business objects in EmailsBusiness objects in Emails

• Study on 6 organizations show:– Objects can be identified by patterns and gazeteers– It is possible to define set of common objects

• Objects identified:– Organization:

• org:Name, org:RegNo, org:TaxNo– Person:

• person:Name, person:Function– Contact:

• contact:Phone, contact:Email, contact:Webpage– Address:

• address:ZIP, address:Street, address:Settlement– Product:

• product:Name, product:Module, product:Component, product:BOID– Document:

• doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest– Inventory:

• inventory:ResID, inventory:ResType– Other business object

• ID: BOID

Social Networks and Graph DataSocial Networks and Graph Data

• Relations among objects• Support for search


• Use of Social Network from email• Includes extracted objects• Full text of extracted objects• Related objects discovered and

ordered by spread activation on social network graph

• Faceted search, navigation

Email Search PrototypeEmail Search Prototype


Context based Recommendation, Knowledge SharingContext based Recommendation, Knowledge Sharing

EMBET, Acoma

28

Objective: Recommend and provide user information or knowledge in context

EMBET: proactive information and knowledge provisionEMBET: proactive information and knowledge provision

• Collaboration among users• Knowledge sharing• Active knowledge provision• Reuse of knowledge: notes and other

resources

http://ups.savba.sk/kwfgrid/uaa/http://ups.savba.sk/kwfgrid/uaa/Bratislava, 10th November 2011

29

EMBET: AchievementsEMBET: Achievements

• Software with following functionality

– User Problem description– Displaying Knowledge– Adding Knowledge – Knowledge Reuse– Permanent Notes

Storage– Voting on Notes

• EMBET architecture: Core, GUI

• Context detection

• Context Matching to display information & knowledge

• Plain text analysis using Advanced Semantic Annotation Algorithms – OnTeA

• Theory of different context matching algorithms

Bratislava, 10th November 2011


Acoma: Hint RecommendationAcoma: Hint Recommendation

Information Retrieval and Information ExtractionInformation Retrieval and Information Extraction

lectures

IR LecturesIR Lectures

• Introduction to Information Retrieval• Text Operations, Text Analysis, stemming• Crawling, link processing• IR Models, Indexing techniques• IR Software libraries and systems• Ranking by Graph Algorithms (PageRank, HITS, …) and Searching• Information Extraction• Regular Expressions• Large Scale Data Processing on MapReduce Architecture• Multimedia Information Retrieval• Evaluation Techniques, Precision, Recall• Google• Semantics and IR, Semantic Web Standards

32Bratislava, 10th November 2011

Lectures conditionsLectures conditions

• Every students gets project focused on – Crawling– Indexing– Ranking– Information Extraction– Large Scale information Processing

• They have to consult project 3 times during semester

• Availability of data from day one• Lectures are available at:

– http://vi.ikt.ui.sav.sk/Témy_prednášok

33

Spracovanie odkazov

Indexovač

Usporiadanie

Vyhľadávač

Bázadokumentov

Odkazy

Index dokumentov

Sťahovač

Textové operácie

Otázka

Užívateľ

Zoznam dokumentov

Internet

Bratislava, 10th November 2011

information processing michal laclavík, ladislav hluchý (email research, information extraction,...

Documents

graph network

graph databaseshttp

compliant graph

simple graph databasestorage

information retrieval

large scale text

use of social network

s4graph processing