information processing michal laclavík, ladislav hluchý (email research, information extraction,...

33
Information processing Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Upload: adam-lyons

Post on 12-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Information processingInformation processing

Michal Laclavík, Ladislav Hluchý

(Email research, information extraction, information retrieval, contextual recommendation)

Page 2: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Bratislava, 10th November 2011 2

Primary Research Team & CapabilitiesPrimary Research Team & Capabilities

Dept. of Parallel and Distributed ComputingResearch and Development Areas:

– Large-scale HPCN, Grid and MapReduce applications– Intelligent and Knowledge oriented Technologies

Experience from IST:– 3 project in FP5: ANFAS, CrosGRID, Pellucid– 6 project in FP6: EGEE II, K-Wf Grid, DEGREE

(coordinator), EGEE, int.eu.grid, MEDIGRID– 4 projects in FP7: Commius, Admire, Secricom, EGEE III

Several National Projects (SPVV, VEGA, APVT)IKT Group Focus:

– Information Processing (Large Scale)– Graph Processing – Information Extraction and Retrieval– Semantic Web– Knowledge oriented Technologies– Parallel and Distributed Information Processing

Solutions:– SGDB: Simple Graph Database– gSemSearch: Graph based Semantic Search– Ontea: Pattern-based Semantic Annotation– ACoMA: KM tool in Email– EMBET: Recommendation System– Experts on MapReduce and IR (Nutch, Solr, Lucene)

Director & leader of PDC: Dr. Dipl. Ing. Ladislav Hluchý

URL: http://ikt.ui.sav.sk

Page 3: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Approach and SolutionsApproach and Solutions

Page 4: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Large scale Text and Graph data processingLarge scale Text and Graph data processing

Core Technology• Web crawling

– Nutch + plugins

• Full text indexing and search– lucene, Sorl

• Information Extraction– Ontea, GATE

• All above large scale– Hadoop, S4

• Graph processing and Querying– Simple Graph Database (SGDB)

– gSemSearch

– Neo4j

– Blueprints

Bratislava, 10th November 2011 4

Underlined are the technologies developed by IISAS

Page 5: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Ontea: Information Extraction ToolOntea: Information Extraction Tool

Regex patternsGazetteersResuls

Key-value pairs Structured into trees graphs

Transformers, ConfigurationAutomatic loading of extractors

Visual Annotation Tool Integration with external tools

GATE, Stemers, Hadoop …Multilingual tests

English, Slovak, Spanish, Italian

Bratislava, 10th November 2011 5

http://ontea.sf.net

Page 6: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

• Use of Social Network from email• Includes extracted objects• Full text of extracted objects• Related objects discovered and

ordered by spread activation on social network graph

• Faceted search, navigation

Email Search PrototypeEmail Search Prototype

Bratislava, 10th November 2011 6

Page 7: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

gSemSearch: Graph based Semantic SearchgSemSearch: Graph based Semantic Search

• Graph/Network of interacting (interconnected) entities• Discovering relation in the Graph (network) using spread of activation algorithm• Showing relations of concrete type, e.g. telephone numbers related to a person• Navigation over related entities• Full-text search of the entities• User interface for search• User interaction with data (merging,

deleting entities) with immediate impact on discovered relations

• Tested on Email Enron Corpus– Email Social Network Search– http://ikt.ui.sav.sk/esns/

Bratislava, 10th November 2011 7

Page 8: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

SGDB: Simple Graph DatabaseSGDB: Simple Graph Database

• Storage for graphs• Optimized for graph traversing and spread of activation• Faster then Neo4j for graph traversing operations• Supports Blueprints API• https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3

• Graph Database Benchmarks– Graph Traversal Benchmark for Graph Databases

– http://ups.savba.sk/~marek/gbench.html

– Blueprints API - possibility to test compliant Graph databases

Bratislava, 10th November 2011 8

Page 9: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Future Direction: Relations Discovery in Large Graph DataFuture Direction: Relations Discovery in Large Graph Data

• Motivation– Graph/Network data are everywhere: social networks, web, LinkedData,

transactions, communication (email, phone). – Also text can be converted to graph. – Interconnecting graph data and searching for relations is crucial.

• Approach– Forming semantic trees and graphs from text, web, communication, databases

and LinkedData– User interaction with graph data in order to achieve integration and data

cleansing– Users will do it, if user effort have immediate impact on search results

Bratislava, 10th November 2011 9

Page 10: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Ontea: Pattern based information extraction and Ontea: Pattern based information extraction and semantic annotationsemantic annotation

Text processing

Page 11: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Ontea: Information Extraction (Features)Ontea: Information Extraction (Features)

Regex patternsVisual Annotation Tool Integration with external tools

GATE, Stemers, Hadoop …Gazetteers IE System configurationAutomatic loading of extractorsPatternsMultilingual tests

Spanish Slovak English Italian

Bratislava, 10th November 2011 11

Page 12: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Information Extraction ModelInformation Extraction Model

Address and product patternsAddress and product patterns

ExtractionExtraction

ProcessingProcessing

3 words macro3 words macro

ZIP macroZIP macro

Street number macroStreet number macro

Street name macroStreet name macro

City name macroCity name macro

Country macroCountry macro

Address patternsAddress patterns

Bratislava, 10th November 2011 12

Page 13: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

SegmentationSegmentation

• Sentences • Paragraphs• Objects (Address, Product ..)

Bratislava, 10th November 2011 13

Page 14: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

GazetteerCan extract information, which

cannot be properly extracted by regular expression patterns (like given names, product names, etc.)

Gazetteer extraction approach is combined with regular expressions based extrac-tion. For example personal full names can be extracted with higher precision.

Gazetteer is easy to update, because it is configured by simple text files.

Information Extraction: Gazetteers configurationInformation Extraction: Gazetteers configuration

Bratislava, 10th November 2011 14

Gazetteer listssimple text files with keywords

Gazetteer configurationsimple text file with<list file>:<IE result type>

Information extractor rules

Page 15: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Information Extraction: Rules configurationInformation Extraction: Rules configuration

IE System configuration– IE dynamically loads and run its

components (XMLRegexExtractor, Gazetteer, RuleTransformer) according to setting in IE rules file

– IE Components are executing consecutively and operate on a set of information extraction results

Bratislava, 10th November 2011 15

Information extractor rules file

IE result setModified

IE result setIE component

Regex basedIE component

GazetteerIE component

Result set transformerIE component

Page 16: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Semantic AnnotationSemantic Annotation

Bratislava, 10th November 2011 16

TheThe conceptconcept InformationExtractor - IEInformationExtractor - IE produces produces a set of extraction resultsa set of extraction results

SemanticAnnotator - SASemanticAnnotator - SA consumes consumes the IE result set and builds a trees the IE result set and builds a trees convertible to Ontology instances or convertible to Ontology instances or objects according to XML schema e.g. objects according to XML schema e.g. Core ComponentsCore Components

SA first builds an intermediate tree of IE SA first builds an intermediate tree of IE results on which it operatesresults on which it operates

The tree is upon its creation not compliant The tree is upon its creation not compliant to Core Components specification and to Core Components specification and needs to be transformedneeds to be transformed

Therefore we have Therefore we have tree transformerstree transformers which transform the IE result tree to a treeswhich transform the IE result tree to a trees

Page 17: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Semantic AnnotationSemantic Annotation

• Tree transformers– Input is a tree of IE results and output is the modified tree of IE results

– Tree transformers are executing consecutively and operate on a tree of information extraction results

– Tree transformers, which delete, create,rename, move, switch and order nodesare configured in the SA rules file

Bratislava, 10th November 2011 17

Treetransformer

Page 18: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Social NetworksSocial Networks

Social nework reconstruction:probabilistic inference using spreading

activationrelies on the output of the information

extractor (IE) in the form of complex objects

Bratislava, 10th November 2011 18

Preliminary results on a set of Preliminary results on a set of 50 Spanish emails (phone/name):50 Spanish emails (phone/name):Precision 60% Precision 60% (due to lower recall in IE)(due to lower recall in IE)Precision 85% Precision 85% (achievable with better IE)(achievable with better IE)self-healing self-healing (with new incoming emails)(with new incoming emails)

Page 19: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Social NetworksSocial Networks

Bratislava, 10th November 2011 19

Results as XML or HTML: Results as XML or HTML: (via XSL Transformations)(via XSL Transformations)

Future:Future:

DataSource for Search DataSource for Search for Partner modulefor Partner module

Improve the recall of Improve the recall of Information ExtractorInformation Extractor

Exploit multi-pass algorithm and named entity recognition: things Exploit multi-pass algorithm and named entity recognition: things learned in the first pass will be used in the next, e.g. possible names learned in the first pass will be used in the next, e.g. possible names with initials, etc.with initials, etc.

Build an enhanced statistical reasoning procedure on top of the Build an enhanced statistical reasoning procedure on top of the present Social Network Extractor/Correlatorpresent Social Network Extractor/Correlator

Page 20: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Email ResearchEmail Research

Acoma

Page 21: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Bratislava, 10th November 2011 21

Acoma ArchitectureAcoma Architecture

• Connected to email protocols on desktop or server• No need to change working practices

– Emails are received and send as before

• Received email is processed by Acoma and enriched with useful information

• Extensible with OSGi modulesS

erverD

esktop Mail Client

Browser

Mail Server

POP3IMAP

Acoma

Se

rve

rD

es

kto

p Mail Client

Browser

Mail ServerSMTP

Acoma

Information Processing and Extraction

Mail Server

Modified

Co

nn

ector to

Em

ail Infrastru

cture

System Connectors

Hint Recomendation

Mo

du

le 1

Mo

du

le 2

Mo

du

le n

Mail Client

Browser

Page 22: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Bratislava, 10th November 2011 22

System ConnectorsSystem Connectors

• Connection of Acoma to existing systems– Document Archives– Internet or Intranet Systems– Databases

• Access or import of data • Key-value pair transformation

Meta-Connector

Web Connector

SpreadSheet Connector

Database Connector

Internet

Key-value

TransformedKey-value

Page 23: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Bratislava, 10th November 2011 23

Acoma architecture : Message Post ProcessingAcoma architecture : Message Post Processing

• Useful hints with links are included in enriched email

• Links lead to internal or external systems (Internet, Intranet)

Page 24: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Bratislava, 10th November 2011 24

Business objects in EmailsBusiness objects in Emails

• Study on 6 organizations show:– Objects can be identified by patterns and gazeteers– It is possible to define set of common objects

• Objects identified:– Organization:

• org:Name, org:RegNo, org:TaxNo– Person:

• person:Name, person:Function– Contact:

• contact:Phone, contact:Email, contact:Webpage– Address:

• address:ZIP, address:Street, address:Settlement– Product:

• product:Name, product:Module, product:Component, product:BOID– Document:

• doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest– Inventory:

• inventory:ResID, inventory:ResType– Other business object

• ID: BOID

Page 25: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Social Networks and Graph DataSocial Networks and Graph Data

• Relations among objects• Support for search

Bratislava, 10th November 2011 25

Page 26: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

• Use of Social Network from email• Includes extracted objects• Full text of extracted objects• Related objects discovered and

ordered by spread activation on social network graph

• Faceted search, navigation

Email Search PrototypeEmail Search Prototype

Bratislava, 10th November 2011 26

Page 27: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Context based Recommendation, Knowledge SharingContext based Recommendation, Knowledge Sharing

EMBET, Acoma

Page 28: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

28

Objective: Recommend and provide user information or knowledge in context

EMBET: proactive information and knowledge provisionEMBET: proactive information and knowledge provision

• Collaboration among users• Knowledge sharing• Active knowledge provision• Reuse of knowledge: notes and other

resources

http://ups.savba.sk/kwfgrid/uaa/http://ups.savba.sk/kwfgrid/uaa/Bratislava, 10th November 2011

Page 29: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

29

EMBET: AchievementsEMBET: Achievements

• Software with following functionality

– User Problem description– Displaying Knowledge– Adding Knowledge – Knowledge Reuse– Permanent Notes

Storage– Voting on Notes

• EMBET architecture: Core, GUI

• Context detection

• Context Matching to display information & knowledge

• Plain text analysis using Advanced Semantic Annotation Algorithms – OnTeA

• Theory of different context matching algorithms

Bratislava, 10th November 2011

Page 30: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Bratislava, 10th November 2011 30

Acoma: Hint RecommendationAcoma: Hint Recommendation

Page 31: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Information Retrieval and Information ExtractionInformation Retrieval and Information Extraction

lectures

Page 32: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

IR LecturesIR Lectures

• Introduction to Information Retrieval• Text Operations, Text Analysis, stemming• Crawling, link processing• IR Models, Indexing techniques• IR Software libraries and systems• Ranking by Graph Algorithms (PageRank, HITS, …) and Searching• Information Extraction• Regular Expressions• Large Scale Data Processing on MapReduce Architecture• Multimedia Information Retrieval• Evaluation Techniques, Precision, Recall• Google• Semantics and IR, Semantic Web Standards

32Bratislava, 10th November 2011

Page 33: Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Lectures conditionsLectures conditions

• Every students gets project focused on – Crawling– Indexing– Ranking– Information Extraction– Large Scale information Processing

• They have to consult project 3 times during semester

• Availability of data from day one• Lectures are available at:

– http://vi.ikt.ui.sav.sk/Témy_prednášok

33

Spracovanie odkazov

Indexovač

Usporiadanie

Vyhľadávač

Bázadokumentov

Odkazy

Index dokumentov

Sťahovač

Textové operácie

Otázka

Užívateľ

Zoznam dokumentov

Internet

Bratislava, 10th November 2011