email processing and recommendation michal laclavík, ladislav hluchý, martin Šeleng (email...
Post on 16-Jan-2016
217 Views
Preview:
TRANSCRIPT
EmailEmail ProcessingProcessing and and RecommendationRecommendation
Michal Laclavík, Ladislav Hluchý, Martin Šeleng
(Email research, information extraction, information retrieval, contextual recommendation)
AbstractAbstract
In this presentation we give overview to our research focusing on text processing and recommendation. We focus on information and knowledge hidden in email communication in organizational or enterprise context.
We exploit simple information extraction techniques based on patterns and gazetteers to deliver semantic or semi formal understanding of text (email) content and context. Context is used for recommendation. We have developed proof –of-concept prototypes of email based recommendation and search based on key-value pairs (named entities) extracted from text (emails), based on hierarchical trees build from recognized entities. In addition we exploit social networks hidden in email archives.
Vienna, 14th October 2010 2IRF-TUWIEN Doctoral Seminar
Vienna, 14th October 2010 3
Primary Research Team & CapabilitiesPrimary Research Team & Capabilities
Dept. of Parallel and Distributed ComputingResearch and Development Areas:
– Large-scale HPCN and Grid applications– Intelligent and Knowledge oriented Technologies
Experience from European IST projects:– 3 project in FP5: ANFAS, CrosGRID, Pellucid– 6 project in FP6: EGEE II, K-Wf Grid, DEGREE
(coordinator), EGEE, int.eu.grid, MEDIGRID– 4 projects in FP7:
Commius, Admire, EGEE III, SecricomSeveral National Projects (SPVV, VEGA, APVT)IKT Group Focus:
– Information Processing– Semantic Web– Knowledge oriented Technologies– Parallel and Distributed
Information ProcessingSolutions:
– Ontea: Pattern-based Semantic Annotation– ACoMA: KM tool in Email– EMBET: Recommendation System
Director & leader of PDC: Dr. Dipl. Ing. Ladislav Hluchý
URL: http://ikt.ui.sav.sk
IRF-TUWIEN Doctoral Seminar
Ontea: Pattern based information extraction and Ontea: Pattern based information extraction and semantic annotationsemantic annotation
Text processing
Ontea: Information Extraction (Features)Ontea: Information Extraction (Features)
Regex patternsVisual Annotation Tool Integration with external tools
GATE, Stemers, Hadoop …Gazetteers IE System configurationAutomatic loading of extractorsPatternsMultilingual tests
Spanish Slovak English Italian
Vienna, 14th October 2010 5IRF-TUWIEN Doctoral Seminar
Information Extraction ModelInformation Extraction Model
Address and product patternsAddress and product patterns
ExtractionExtraction
ProcessingProcessing
3 words macro3 words macro
ZIP macroZIP macro
Street number macroStreet number macro
Street name macroStreet name macro
City name macroCity name macro
Country macroCountry macro
Address patternsAddress patterns
Vienna, 14th October 2010 6IRF-TUWIEN Doctoral Seminar
SegmentationSegmentation
• Sentences • Paragraphs• Objects (Address, Product ..)
Vienna, 14th October 2010 7IRF-TUWIEN Doctoral Seminar
GazetteerCan extract information, which
cannot be properly extracted by regular expression patterns (like given names, product names, etc.)
Gazetteer extraction approach is combined with regular expressions based extrac-tion. For example personal full names can be extracted with higher precision.
Gazetteer is easy to update, because it is configured by simple text files.
Information Extraction: Gazetteers configurationInformation Extraction: Gazetteers configuration
Vienna, 14th October 2010 8
Gazetteer listssimple text files with keywords
Gazetteer configurationsimple text file with<list file>:<IE result type>
Information extractor rules
IRF-TUWIEN Doctoral Seminar
Information Extraction: Rules configurationInformation Extraction: Rules configuration
IE System configuration– IE dynamically loads and run its
components (XMLRegexExtractor, Gazetteer, RuleTransformer) according to setting in IE rules file
– IE Components are executing consecutively and operate on a set of information extraction results
Vienna, 14th October 2010 9
Information extractor rules file
IE result setModified
IE result setIE component
Regex basedIE component
GazetteerIE component
Result set transformerIE component
IRF-TUWIEN Doctoral Seminar
Semantic AnnotationSemantic Annotation
Vienna, 14th October 2010 10
TheThe conceptconcept InformationExtractor - IEInformationExtractor - IE produces produces a set of extraction resultsa set of extraction results
SemanticAnnotator - SASemanticAnnotator - SA consumes consumes the IE result set and builds a trees the IE result set and builds a trees convertible to Ontology instances or convertible to Ontology instances or objects according to XML schema e.g. objects according to XML schema e.g. Core ComponentsCore Components
SA first builds an intermediate tree of IE SA first builds an intermediate tree of IE results on which it operatesresults on which it operates
The tree is upon its creation not compliant The tree is upon its creation not compliant to Core Components specification and to Core Components specification and needs to be transformedneeds to be transformed
Therefore we have Therefore we have tree transformerstree transformers which transform the IE result tree to a treeswhich transform the IE result tree to a trees
IRF-TUWIEN Doctoral Seminar
Semantic AnnotationSemantic Annotation
• Tree transformers– Input is a tree of IE results and output is the modified tree of IE results
– Tree transformers are executing consecutively and operate on a tree of information extraction results
– Tree transformers, which delete, create,rename, move, switch and order nodesare configured in the SA rules file
Vienna, 14th October 2010 11
Treetransformer
IRF-TUWIEN Doctoral Seminar
Social NetworksSocial Networks
Social network reconstruction:probabilistic inference using spreading
activationrelies on the output of the information
extractor (IE) in the form of complex objects
Vienna, 14th October 2010 12
Preliminary results on a set of Preliminary results on a set of 50 Spanish emails (phone/name):50 Spanish emails (phone/name):Precision 60% Precision 60% (due to lower recall in IE)(due to lower recall in IE)Precision 85% Precision 85% (achievable with better IE)(achievable with better IE)self-healing self-healing (with new incoming emails)(with new incoming emails)
IRF-TUWIEN Doctoral Seminar
Social NetworksSocial Networks
Vienna, 14th October 2010 13
Results as XML or HTML: Results as XML or HTML: (via XSL Transformations)(via XSL Transformations)
Future:Future:
DataSource for Search DataSource for Search for Partner modulefor Partner module
Improve the recall of Improve the recall of Information ExtractorInformation Extractor
Exploit multi-pass algorithm and named entity recognition: things Exploit multi-pass algorithm and named entity recognition: things learned in the first pass will be used in the next, e.g. possible names learned in the first pass will be used in the next, e.g. possible names with initials, etc.with initials, etc.
Build an enhanced statistical reasoning procedure on top of the Build an enhanced statistical reasoning procedure on top of the present Social Network Extractor/Correlatorpresent Social Network Extractor/Correlator
IRF-TUWIEN Doctoral Seminar
Email ResearchEmail Research
Acoma
Vienna, 14th October 2010 15
Acoma ArchitectureAcoma Architecture
• Connected to email protocols on desktop or server• No need to change working practices
– Emails are received and send as before
• Received email is processed by Acoma and enriched with useful information
• Extensible with OSGi modulesS
erverD
esktop Mail Client
Browser
Mail Server
POP3IMAP
Acoma
Se
rve
rD
es
kto
p Mail Client
Browser
Mail ServerSMTP
Acoma
Information Processing and Extraction
Mail Server
Modified
Co
nn
ector to
Em
ail Infrastru
cture
System Connectors
Hint Recomendation
Mo
du
le 1
Mo
du
le 2
Mo
du
le n
Mail Client
Browser
IRF-TUWIEN Doctoral Seminar
Vienna, 14th October 2010 16
System ConnectorsSystem Connectors
• Connection of Acoma to existing systems– Document Archives– Internet or Intranet Systems– Databases
• Access or import of data • Key-value pair transformation
Meta-Connector
Web Connector
SpreadSheet Connector
Database Connector
Internet
Key-value
TransformedKey-value
IRF-TUWIEN Doctoral Seminar
Vienna, 14th October 2010 17
Acoma architecture : Message Post ProcessingAcoma architecture : Message Post Processing
• Useful hints with links are included in enriched email
• Links lead to internal or external systems (Internet, Intranet)
IRF-TUWIEN Doctoral Seminar
Vienna, 14th October 2010 18
Business objects in EmailsBusiness objects in Emails
• Study on 6 organizations show:– Objects can be identified by patterns and gazeteers– It is possible to define set of common objects
• Objects identified:– Organization:
• org:Name, org:RegNo, org:TaxNo– Person:
• person:Name, person:Function– Contact:
• contact:Phone, contact:Email, contact:Webpage– Address:
• address:ZIP, address:Street, address:Settlement– Product:
• product:Name, product:Module, product:Component, product:BOID– Document:
• doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest– Inventory:
• inventory:ResID, inventory:ResType– Other business object
• ID: BOID
IRF-TUWIEN Doctoral Seminar
Social Networks and Graph DataSocial Networks and Graph Data
• Relations among objects• Support for search
Vienna, 14th October 2010 19IRF-TUWIEN Doctoral Seminar
• Use of Social Network from email• Includes extracted objects• Full text of extracted objects• Related objects discovered and
ordered by spread activation on social network graph
• Faceted search, navigation
Email Search PrototypeEmail Search Prototype
Vienna, 14th October 2010 20IRF-TUWIEN Doctoral Seminar
Context based Recommendation, Knowledge SharingContext based Recommendation, Knowledge Sharing
EMBET, Acoma
22
Objective: Recommend and provide user information or knowledge in context
EMBET: proactive information and knowledge provisionEMBET: proactive information and knowledge provision
• Collaboration among users• Knowledge sharing• Active knowledge provision• Reuse of knowledge: notes and other
resources
http://ups.savba.sk/kwfgrid/uaa/http://ups.savba.sk/kwfgrid/uaa/Vienna, 14th October 2010IRF-TUWIEN Doctoral Seminar
23
EMBET: AchievementsEMBET: Achievements
• Software with following functionality
– User Problem description– Displaying Knowledge– Adding Knowledge – Knowledge Reuse– Permanent Notes
Storage– Voting on Notes
• EMBET architecture: Core, GUI
• Context detection
• Context Matching to display information & knowledge
• Plain text analysis using Advanced Semantic Annotation Algorithms – OnTeA
• Theory of different context matching algorithms
Vienna, 14th October 2010IRF-TUWIEN Doctoral Seminar
Vienna, 14th October 2010 24
Acoma: Hint RecommendationAcoma: Hint Recommendation
IRF-TUWIEN Doctoral Seminar
Information Retrieval and Information ExtractionInformation Retrieval and Information Extraction
lectures
IR LecturesIR Lectures
• Introduction to Information Retrieval• Text Operations, Text Analysis, stemming• Crawling, link processing• IR Models, Indexing techniques• IR Software libraries and systems• Ranking by Graph Algorithms (PageRank, HITS, …) and Searching• Information Extraction• Regular Expressions• Large Scale Data Processing on MapReduce Architecture• Multimedia Information Retrieval• Evaluation Techniques, Precision, Recall• Google• Semantics and IR, Semantic Web Standards
26Vienna, 14th October 2010IRF-TUWIEN Doctoral Seminar
Lectures conditionsLectures conditions
• Every students gets project focused on – Crawling– Indexing– Ranking– Information Extraction– Large Scale information Processing
• They have to consult project 3 times during semester
• Availability of data from day one• Lectures are available at:
– http://vi.ikt.ui.sav.sk/Témy_prednášok
27
Spracovanie odkazov
Indexovač
Usporiadanie
Vyhľadávač
Bázadokumentov
Odkazy
Index dokumentov
Sťahovač
Textové operácie
Otázka
Užívateľ
Zoznam dokumentov
Internet
Vienna, 14th October 2010IRF-TUWIEN Doctoral Seminar
top related