enron emails as graph data corpus for large-scale graph querying experimentation michal laclavík,...

13
Enron Emails as Graph Data Corpus for Large- Enron Emails as Graph Data Corpus for Large- scale Graph Querying Experimentation scale Graph Querying Experimentation Michal Laclavík, Martin Šeleng, Marek Ciglan, Ladislav Hluchý

Upload: kristin-harvey

Post on 26-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Enron Emails as Graph Data Corpus for Large-scale Enron Emails as Graph Data Corpus for Large-scale Graph Querying ExperimentationGraph Querying Experimentation

Michal Laclavík, Martin Šeleng, Marek Ciglan, Ladislav Hluchý

Motivation and ApproachMotivation and Approach

Motivation• To exploit information and knowledge included in email communication

Approach• Social Network Extraction• Entities extraction like People, Organizations, Locations, Contact data• Forming semantic trees and graphs• User interaction with graph data

Bratislava, 26 October 2011 GCCP 2011 2

Email Social NetworksEmail Social Networks

• Email Social Networks are less explored– Several scientific publications:

Apache mailing list, Enron, …

– Commercial: Xobni (contacts and attachments)

• Benefit– Web Social Network Sites: owned by third parties

– Email SN: owned by organization, individual or community

– Additional level of interaction and context is present in emails

• Information and Knowledge– People, locations, contacts, product, services, attachments or links

– Interactions

– Time

– Discovering relations can bring significant benefits

– Spread of Activation – simple way to discover relations

Bratislava, 26 October 2011 GCCP 2011 3

Ontea: Information Extraction ToolOntea: Information Extraction Tool

Regex patternsGazetteersResuls

Key-value pairs Structured into trees graphs

Transformers, ConfigurationAutomatic loading of extractors

Visual Annotation Tool Integration with external tools

GATE, Stemers, Hadoop …Multilingual tests

English, Slovak, Spanish, Italian

GCCP 2011 4Bratislava, 26 October 2011

http://ontea.sf.net

GCCP 2011 5

Business objects in EmailsBusiness objects in Emails

• Study on 6 organizations show:– Objects can be identified by patterns and gazeteers– It is possible to define set of common objects

• Objects identified:– Organization:

• org:Name, org:RegNo, org:TaxNo– Person:

• person:Name, person:Function– Contact:

• contact:Phone, contact:Email, contact:Webpage– Address:

• address:ZIP, address:Street, address:Settlement– Product:

• product:Name, product:Module, product:Component, product:BOID– Document:

• doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest– Inventory:

• inventory:ResID, inventory:ResType– Other business object

• ID: BOID

Bratislava, 26 October 2011

Email Social Graph/NetworkEmail Social Graph/Network

Bratislava, 26 October 2011 GCCP 2011 6

• Use of Social Network from email• Includes extracted objects• Full text of extracted objects• Related objects discovered and

ordered by spread activation on social network graph

• Faceted search, navigation• http://ikt.ui.sav.sk/esns/

Email Search PrototypeEmail Search Prototype

GCCP 2011 7Bratislava, 26 October 2011

gSemSearch: Graph based Semantic Search

Email ExampleEmail Example

1 Vertex: Doc=>/home/misos/enron/test/6.eml

1 Vertex: Quote=>/6.eml0:1:0

2 Edge: (Doc=>/home/misos/enron/test/6.eml)=>(Quote=>/6.eml0:1:0)

1 Vertex: Paragraph=>/6.eml0:1:0

2 Edge: (Quote=>/6.eml0:1:0)=>(Paragraph=>/6.eml0:1:0)

1 Vertex: Sentence=>/6.eml0:1:0

2 Edge: (Paragraph=>/6.eml0:1:0)=>(Sentence=>/6.eml0:1:0)

1 Vertex: DateTime=>Fri, 8 Mar 2002 06:46:07 -0800 (PST)

2 Edge: (Sentence=>/6.eml0:1:0)=>(DateTime=>Fri, 8 Mar 2002 06:46:07 -0800 (PST))

1 Vertex: Email=>[email protected]

2 Edge: (Sentence=>/6.eml0:1:0)=>(Email=>[email protected])

1 Vertex: Email=>[email protected]

2 Edge: (Sentence=>/6.eml0:1:0)=>(Email=>[email protected])

1 Vertex: Person:Name=>Grigsby, Mike

2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Grigsby, Mike)

1 Vertex: Company=>ENRON

2 Edge: (Sentence=>/6.eml0:1:0)=>(Company=>ENRON)

1 Vertex: Person:Name=>Badeer, Robert

2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Badeer, Robert)

1 Vertex: Company=>ENRON

2 Edge: (Sentence=>/6.eml0:1:0)=>(Company=>ENRON)

1 Vertex: Person:GivenName=>Robert

2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:GivenName=>Robert)

1 Vertex: Person:Name=>Badeer, Robert

2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Badeer, Robert)

1 Vertex: Paragraph=>/6.eml659:19:0

2 Edge: (Quote=>/6.eml0:1:0)=>(Paragraph=>/6.eml659:19:0)

1 Vertex: Sentence=>/6.eml659:19:0

2 Edge: (Paragraph=>/6.eml659:19:0)=>(Sentence=>/6.eml659:19:0)

1 Vertex: Person:Name=>Michael D. Grigsby

2 Edge: (Sentence=>/6.eml659:19:0)=>(Person:Name=>Michael D. Grigsby)

1 Vertex: Company=>UBS Warburg Energy, LLC

2 Edge: (Sentence=>/6.eml659:19:0)=>(Company=>UBS Warburg Energy, LLC)

1 Vertex: TelephoneNumber=>713-853-7031

2 Edge: (Sentence=>/6.eml659:19:0)=>(TelephoneNumber=>713-853-7031)

1 Vertex: TelephoneNumber=>713-408-6256

2 Edge: (Sentence=>/6.eml659:19:0)=>(TelephoneNumber=>713-408-6256)

Bratislava, 26 October 2011 GCCP 2011 8

Enron Graph corpus StatisticsEnron Graph corpus Statistics

Bratislava, 26 October 2011 GCCP 2011 9

Description Size/Count

Corpus Size 2.5 GB

Compressed Corpus Size 217 MB

Messages 517,377

Nodes 8,269,278

Edges 20,383,709

Address 4,997

CityName 1,550

Company 52,286

DateTime 228,175

Email 162,754

MoneyAmount 28,992

Paragraph 2,631,292

Person 167,613

Quote 533,007

Sentence 3,800,504

TelephoneNumber 26,013

WebAddress 105,610

Conclusions and Future DirectionsConclusions and Future Directions

Future Direction: Relations Discovery in Large Graph DataFuture Direction: Relations Discovery in Large Graph Data

• Motivation– Graph/Network data are everywhere: social networks, web, LinkedData,

transactions, communication (email, phone). – Also text can be converted to graph. – Interconnecting graph data and searching for relations is crucial.

• Approach– Forming semantic trees and graphs from text, web, communication, databases

and LinkedData– User interaction with graph data in order to achieve integration and data

cleansing– Users will do it, if user effort have immediate impact on search results

Bratislava, 26 October 2011 GCCP 2011 11

SGDB: Simple Graph DatabaseSGDB: Simple Graph Database

• Storage for graphs• Optimized for graph traversing and spread of activation• Faster then Neo4j for graph traversing operations• Supports Blueprints API• https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3

• Graph Database Benchmarks– Graph Traversal Benchmark for Graph Databases

– http://ups.savba.sk/~marek/gbench.html

– Blueprints API - possibility to test compliant Graph databases

Bratislava, 26 October 2011 GCCP 2011 12

• Email Archives– Valuable source of knowledge

– Hidden Social Networks owned by Enterprise or Individual

– Information Extraction and Social Network Analysis can help

• Challenges– Graph based Querying

– New data and approach for information search

– Relation search

• Applications– Recommendation and Search in Emails

– Population of Databases (Cold start problem)

– Possibility to extend social network graph with transaction data, processed document repositories and other business data

– Business Intelligence and Knowledge Management

ConclusionConclusion

Bratislava, 26 October 2011 GCCP 2011 13