link analysis: current state of the art

56
Link Analysis: Current State of the Art Ronen Feldman Computer Science Department Bar-Ilan University, ISRAEL [email protected]

Upload: binta

Post on 23-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Link Analysis: Current State of the Art. Ronen Feldman Computer Science Department Bar-Ilan University, ISRAEL [email protected]. Introduction to Text Mining. Actual information buried inside documents. Extract Information from within the documents. TM != Search. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Link Analysis: Current State of the Art

Link Analysis: Current State of the Art

Ronen FeldmanComputer Science Department

Bar-Ilan University, ISRAEL

[email protected]

Page 2: Link Analysis: Current State of the Art

Introduction to Text Mining

Page 3: Link Analysis: Current State of the Art

Find Documents matching the Query

Display Information relevant to the Query

Extract Information from within the documents

Actual information buried inside documents

Long lists of documents Aggregate over entire collection

Page 4: Link Analysis: Current State of the Art

ReadRead

ConsolidateConsolidate

Absorb / ActAbsorb / Act

UnderstandUnderstand

Find MaterialFind Material

Let Text Mining Do the Legwork for You

Text MiningText Mining

Page 5: Link Analysis: Current State of the Art

What Is Unique in Text Mining?

• Feature extraction.• Very large number of features that

represent each of the documents.• The need for background knowledge.• Even patterns supported by small number

of document may be significant.• Huge number of patterns, hence need for

visualization, interactive exploration.

Page 6: Link Analysis: Current State of the Art

Document Types• Structured documents

– Output from CGI• Semi-structured documents

– Seminar announcements– Job listings– Ads

• Free format documents– News– Scientific papers

Page 7: Link Analysis: Current State of the Art

Text Representations

• Character Trigrams• Words• Linguistic Phrases• Non-consecutive phrases• Frames• Scripts• Role annotation• Parse trees

Page 8: Link Analysis: Current State of the Art

The 100,000 foot PictureBusiness I ntelligence Suite

Business I ntelligenceSuites

ClearTags Suite(Intelligent Auto-Tagging)

IntelligentTagging

Semantic TaggingStatistical TaggingStructural Tagging

WEB SI TES/HTML

NEWSFEEDS

I NTERNALDOCUMENTS

OTHER“RAW” DATA

WEB SI TES/HTML

NEWSFEEDS

I NTERNALDOCUMENTS

OTHER“RAW” DATA

U n s t r u c t u r e d C o n t e n t

RichXML/ API

RichXML/ API

External SystemsIntegration

CorporateDatabases

FileSystems

WorkflowSystems

CorporateDatabases

FileSystems

WorkflowSystems

RichXML/ API

RichXML/ API

Page 9: Link Analysis: Current State of the Art

Intelligent Auto-Tagging(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson

<Facility>Finsbury Park Mosque</Facility>

<PersonPositionOrganization>  <OFFLEN OFFSET="3576" LENGTH=“33" />  <Person>Abu Hamza al-Masri</Person> <Position>chief cleric</Position> <Organization>Finsbury Park Mosque</Organization> </PersonPositionOrganization>

<Country>England</Country>

<PersonArrest>  <OFFLEN OFFSET="3814" LENGTH="61" />   <Person>Abu Hamza al-Masri</Person>   <Location>London</Location> <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb

plot</Reason>   </PersonArrest>

<Country>England</Country>

<Country>France </Country>

<Country>United States</Country>

<Country>Belgium</Country>

<Person>Abu Hamza al-Masri</Person>

<City>London</City>

…….

The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States.

``The mosque's chief cleric, Abu Hamza al-Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited. .''

……

Page 10: Link Analysis: Current State of the Art

Intelligence Article

Page 11: Link Analysis: Current State of the Art

Google’s Article

Page 12: Link Analysis: Current State of the Art

Merger

Page 13: Link Analysis: Current State of the Art

Leveraging Content Investment

Any type of content • Unstructured textual content (current focus)• Structured data; audio; video (future)

From any source • WWW; file systems; news feeds; etc. • Single source or combined sources

In any format • Documents; PDFs; E-mails; articles; etc • “Raw” or categorized• Formal; informal; combination

Page 14: Link Analysis: Current State of the Art

Information Extraction

Page 15: Link Analysis: Current State of the Art

Relevant IE Definitions• Entity: an object of interest such as a person

or organization.• Attribute: a property of an entity such as its

name, alias, descriptor, or type.• Fact: a relationship held between two or

more entities such as Position of a Person in a Company.

• Event: an activity involving several entities such as a terrorist act, airline crash, management change, new product introduction.

Page 16: Link Analysis: Current State of the Art

IE Accuracy by Information Type

Information Type

Accuracy

Entities 90-98%

Attributes 80%

Facts 60-70%

Events 50-60%

Page 17: Link Analysis: Current State of the Art

MUC Conferences

Conference Year Topic

MUC 1 1987 Naval Operations

MUC 2 1989 Naval Operations

MUC 3 1991 Terrorist Activity

MUC 4 1992 Terrorist Activity

MUC 5 1993 Joint Venture and Micro Electronics

MUC 6 1995 Management Changes

MUC 7 1997 Spaces Vehicles and Missile Launches

Page 18: Link Analysis: Current State of the Art

Applications of Information Extraction

• Routing of Information• Infrastructure for IR and for

Categorization (higher level features)• Event Based Summarization.• Automatic Creation of Databases and

Knowledge Bases.

Page 19: Link Analysis: Current State of the Art

Where would IE be useful?

• Semi-Structured Text• Generic documents like News articles.• Most of the information in the document is

centered around a set of easily identifiable entities.

Page 20: Link Analysis: Current State of the Art

Approaches for Building IE Systems

• Knowledge Engineering Approach– Rules are crafted by linguists in cooperation with

domain experts.– Most of the work is done by inspecting a set of relevant

documents.– Can take a lot of time to fine tune the rule set.– Best results were achieved with KB based IE systems.– Skilled/gifted developers are needed.– A strong development environment is a MUST!

Page 21: Link Analysis: Current State of the Art

Approaches for Building IE Systems

• Automatically Trainable Systems– The techniques are based on pure statistics and

almost no linguistic knowledge– They are language independent– The main input is an annotated corpus– Need a relatively small effort when building the rules,

however creating the annotated corpus is extremely laborious.

– Huge number of training examples is needed in order to achieve reasonable accuracy.

– Hybrid approaches can utilize the user input in the development loop.

Page 22: Link Analysis: Current State of the Art

Components of IE System

Tokenization

Morphological andLexical Analysis

Synatctic Analysis

Domain Analysis

Zoning

Part of Speech Tagging

Sense Disambiguiation

Deep Parsing

Shallow Parsing

Anaphora Resolution

Integration

Must

Advisable

Nice to have

Can pass

Page 23: Link Analysis: Current State of the Art

Why is IE Difficult?• Different Languages

– Morphology is very easy in English, much harder in German and Hebrew.

– Identifying word and sentence boundaries is fairly easy in European language, much harder in Chinese and Japanese.

– Some languages use orthography (like english) while others (like hebrew, arabic etc) do no have it.

• Different types of style– Scientific papers– Newspapers– memos– Emails– Speech transcripts

• Type of Document– Tables– Graphics– Small messages vs. Books

Page 24: Link Analysis: Current State of the Art

Link Analysis on Large Textual Networks

Social Network Analysis

Page 25: Link Analysis: Current State of the Art

The Kevin Bacon Game• The game works as follows: given any actor,

find a path between the actor and Kevin Bacon that has less than 6 edges.

• For instance, Kevin Costner links to Kevin Bacon by using one direct link: Both were in JFK.

• Julia Louis-Dreyfus of TV's Seinfeld, however, needs two links to make a path: Julia Louis-Dreyfus was in Christmas Vacation (1989) with Keith MacKechnie. Keith MacKechnie was in We Married Margo (2000) with Kevin Bacon.

• You can play the game by using the following URL http://www.cs.virginia.edu/oracle/.

Page 26: Link Analysis: Current State of the Art

The Erdos Number• A similar idea is also used in the mathematical

society and is called the Erdös number of a researcher.

• Paul Erdös (1913–1996), wrote hundreds of mathematical research papers in many different areas, many in collaboration with others.

• There is a link between any two mathematicians if they co-authored a paper.

• Paul Erdös is the root of the mathematical research network and his Erdös number is 0.

• Erdös’s co-authors have Erdös number 1. • People other than Erdös who have written a joint

paper with someone with Erdös number 1 but not with Erdös have Erdös number 2, and so on.

Page 27: Link Analysis: Current State of the Art

Running Example

Page 28: Link Analysis: Current State of the Art

Hijackers by Flight

Flight 77 : Pentagon Flight 11 : WTC 1 Flight 175 : WTC 2 Flight 93: PA

Khalid Al-Midhar Satam Al Suqami Marwan Al-Shehhi

Saeed Alghamdi

Majed Moqed Waleed M. Alshehri

Fayez Ahmed Ahmed Alhaznawi

Nawaq Alhamzi Wail Alshehri Ahmed Alghamdi Ahmed Alnami

Salem Alhamzi Mohamed Atta Hamza Alghamdi Ziad Jarrahi

Hani Hanjour Abdulaziz Alomari Mohald Alshehri  

Page 29: Link Analysis: Current State of the Art

Automatic layout of networks

Pretty Graph Drawing

Page 30: Link Analysis: Current State of the Art

Motivation I

• In order to display large networks on the screen we need to use automatic layout algorithms. These algorithms display the graphs in an aesthetic way without any user intervention.

• The most commonly used aesthetic criteria are to expose symmetries and make drawing as compact as possible or alternatively fill the space available for the drawing.

Page 31: Link Analysis: Current State of the Art

Motivation II

• Many of the “higher-level” aesthetic criteria are implicit consequences of:– minimized number of edge crossings– evenly distributed edge length– evenly distributed vertex positions on the

graph area– sufficiently large vertex-edge distances– sufficiently large angular resolution between

edges.

Page 32: Link Analysis: Current State of the Art

Disadvantages of the Spring based methods

• They are computationally expensive and hence minimizing the energy function when dealing with large graphs is computationally prohibitive.

• Since all methods rely on heuristics, there is no guarantee that the “best” layout will be found.

• The methods behave as black boxes and hence it is almost impossible to integrate additional constraints on the layout (such as fixing the positions of certain vertices, or specifying the relative ordering of the vertices)

• Even when the graphs are planar it is quite possible that we will get edge crossings.

• The methods try to optimize just the placement of vertices and edges while ignoring the exact shape of the vertices or the fact the vertices may have labels.

Page 33: Link Analysis: Current State of the Art

Kamada and Kawai’s (KK) Method

Page 34: Link Analysis: Current State of the Art

Fruchterman Reingold (FR) Method

Page 35: Link Analysis: Current State of the Art

Classic Graph Operations

Page 36: Link Analysis: Current State of the Art

Finding the shortest Path (from Atta)

Page 37: Link Analysis: Current State of the Art

A better Visualization

Page 38: Link Analysis: Current State of the Art

Centrality

Page 39: Link Analysis: Current State of the Art

Degree

• If the graph is undirected then the degree of a vertex v V is the number of other vertices that are directly connected to it. – degree(v) = |{(v1, v2) E | v1 = v or v2 = v}|

• If the graph is directed then we can talk about in-degree or out-degree. An edge (v1,v2) E in the directed graph is leading from vertex v1 to v2. – In-degree(v) = |{(v1, v) E }|– Out-degree(v) = |{(v, v2) E }|

Page 40: Link Analysis: Current State of the Art

Degree of the HijackersName Degree Mohamed Atta 11 Abdulaziz Alomari 11 Ziad Jarrahi 9 Fayez Ahmed 8 Waleed M. Alshehri 7 Wail Alshehri 7 Satam Al Suqami 7 Salem Alhamzi 7 Marwan Al-Shehhi 7 Majed Moqed 7 Khalid Al-Midhar 6 Hani Hanjour 6 Nawaq Alhamzi 5 Ahmed Alghamdi 5 Saeed Alghamdi 3 Mohald Alshehri 3 Hamza Alghamdi 3 Ahmed Alnami 1 Ahmed Alhaznawi 1

Page 41: Link Analysis: Current State of the Art

Closeness Centrality - Motivation

• Degree centrality measures might be criticized because they only take into account the direct connections that an entity has, rather than indirect connections to all other entities.

• One entity might be directly connected to a large number of entities that might be pretty isolated from the network. Such an entity is central only in a local neighborhood of the network.

Page 42: Link Analysis: Current State of the Art

Closeness Centrality• This measure is based on the calculation of the

geodesic distance between the entity and all other entities in the network.

• We can either use directed or undirected geodesic distances between the entities.

• The sum of these geodesic distances for each entity is the "farness" of the entity from all other entities.

• We can convert this into a measure of closeness centrality by taking the reciprocal.

• In addition, we can normalize the closeness measure by dividing it by the closeness measure of the most central entity.

Page 43: Link Analysis: Current State of the Art

Closeness : Formally

• let d(v1,v2) = the minimal distance between v1 and v2, i.e., the minimal number of vertices that we need to pass on the way from v1 to v2.

| | 1

( , )i

i jj i

VCd v v

Page 44: Link Analysis: Current State of the Art

Closeness of the HijackersName Closeness

Abdulaziz Alomari 0.6

Ahmed Alghamdi 0.5454545

Ziad Jarrahi 0.5294118

Fayez Ahmed 0.5294118

Mohamed Atta 0.5142857

Majed Moqed 0.5142857

Salem Alhamzi 0.5142857

Hani Hanjour 0.5

Marwan Al Shehhi 0.4615385

Satam Al Suqami 0.4615385

Waleed M. Alshehri 0.4615385

Wail Alshehri 0.4615385

Hamza Alghamdi 0.45

Khalid Al Midhar 0.4390244

Mohald Alshehri 0.4390244

Nawaq Alhamzi 0.3673469

Saeed Alghamdi 0.3396226

Ahmed Alnami 0.2571429

Ahmed Alhaznawi 0.2571429

Page 45: Link Analysis: Current State of the Art

Betweeness Centrality

• The betweeness centrality measures the effectiveness in which the vertex connects the various parts of the network.

• The main idea behind betweeness centrality is that entities that are mediators have more power. Entities that are on many geodesic paths between other pairs of entities are more powerful since they control the flow of information between the pairs.

Page 46: Link Analysis: Current State of the Art

Betweeness - Formally

• Highest Possible Betweeness• gjk = the number of geodetic paths that

connect vj with vk• gjk(vi) = the number of geodetic paths that

connect vj with vk and pass via vi.

(| | 1)(| | 2)2

V V

( )

2(| | 1)(| | 2)

jk ii

j k jk

ii

g vB

g

BNBV V

Page 47: Link Analysis: Current State of the Art

Betweenness of the HijackersName Betweeness (Bi) Hamza Alghamdi 0.3059446 Saeed Alghamdi 0.2156863 Ahmed Alghamdi 0.210084 Abdulaziz Alomari 0.1848669 Mohald Alshehri 0.1350763 Mohamed Atta 0.1224783 Ziad Jarrahi 0.0807656 Fayez Ahmed 0.0686275 Majed Moqed 0.0483901 Salem Alhamzi 0.0483901 Hani Hanjour 0.0317955 Khalid Al-Midhar 0.0184832 Nawaq Alhamzi 0 Marwan Al-Shehhi 0 Satam Al Suqami 0 Waleed M. Alshehri 0 Wail Alshehri 0 Ahmed Alnami 0 Ahmed Alhaznawi 0

Page 48: Link Analysis: Current State of the Art

Eigen Vector Centrality

• The main idea behind eigenvector centrality is that entities receiving many communications from other well connected entities, will be better and more valuable sources of information, and hence be considered central. The Eigenvector centrality scores correspond to the values of the principal eigenvector of the adjacency matrix M.

• Formally, the vector v satisfies the equation where is the corresponding eigenvalue and M is the adjacency matrix.

v Mv

Page 49: Link Analysis: Current State of the Art

EigenVector centralities of the hijackers

Name E1

Mohamed Atta 0.518

Marwan Al-Shehhi 0.489

Abdulaziz Alomari 0.296

Ziad Jarrahi 0.246

Fayez Ahmed 0.246

Satam Al Suqami 0.241

Waleed M. Alshehri 0.241

Wail Alshehri 0.241

Salem Alhamzi 0.179

Majed Moqed 0.165

Hani Hanjour 0.151

Khalid Al-Midhar 0.114

Ahmed Alghamdi 0.085

Nawaq Alhamzi 0.064

Mohald Alshehri 0.054

Hamza Alghamdi 0.015

Saeed Alghamdi 0.002

Ahmed Alnami 0

Ahmed Alhaznawi 0

Page 50: Link Analysis: Current State of the Art

Power Centrality• Given an adjacency matrix M, the power centrality

of vertex i (denoted ci), is given by

is used to normalize the score; the normalization parameter is automatically selected so that the sum of squares of the vertices’s centralities is equal to the number of vertices in the network.

is an attenuation factor that controls the effect that the power centralities of the neighboring vertices should have on the power centrality of the vertex.

( )i ij jj i

c M c

Page 51: Link Analysis: Current State of the Art

Power - Motivation• In a similar way to the eigenvector centrality, the

power centrality of each vertex is determined by the centrality of the vertices it is connected to.

• By specifying positive or negative values to the user can control if the fact that a vertex is connected to powerful vertices should have a positive effect on its score or a negative effect.

• The rational for specifying a positive is that if you are connected to powerful colleagues it makes you more powerful.

• On the other hand, the rational for a negative is that powerful colleagues have many connections and hence are not controlled by you, while isolated colleagues have no other sources of information and hence are pretty much controlled by you.

Page 52: Link Analysis: Current State of the Art

Power of the Hijackers  Power : = 0.99 Power : = -0.99

Mohamed Atta 2.254 2.214

Marwan Al-Shehhi 2.121 0.969

Abdulaziz Alomari 1.296 1.494

Ziad Jarrahi 1.07 1.087

Fayez Ahmed 1.07 1.087

Satam Al Suqami 1.047 0.861

Waleed M. Alshehri 1.047 0.861

Wail Alshehri 1.047 0.861

Salem Alhamzi 0.795 1.153

Majed Moqed 0.73 1.029

Hani Hanjour 0.673 1.334

Khalid Al-Midhar 0.503 0.596

Ahmed Alghamdi 0.38 0.672

Nawaq Alhamzi 0.288 0.574

Mohald Alshehri 0.236 0.467

Hamza Alghamdi 0.07 0.566

Saeed Alghamdi 0.012 0.656

Ahmed Alnami 0.003 0.183

Ahmed Alhaznawi 0.003 0.183

Page 53: Link Analysis: Current State of the Art

Network Centralization• In addition to the individual vertex centralization measures,

we can assign a number between 0 and 1 that will signal the level of centralization of the whole network.

• The network centralization measures will be computed based on the centralization values of its vertices and hence we will have for type of individual centralization measure an associated network centralization measure.

• A network that is structured like a circle will have a network centralization value of 0 (since all vertices have the same centralization value), while a network that structured like a star will have a network centralization value of 1.

• We will now provide some of the formulas for the different network centralization measures.

Page 54: Link Analysis: Current State of the Art

Degree

*( ) ( )v VDegree V Max Degree v

*( ) ( )

( 1)*( 2)v V

Degree

Degree V Degree vNET

n n

For the Hijackers network NetDegree= 0.31

Page 55: Link Analysis: Current State of the Art

Betweenness

*( ) ( )v VNB V Max NB v

*( ) ( )

( 1)v V

Bet

NB V NB vNET

n

For the Hijackers network NetBet= 0.24

Page 56: Link Analysis: Current State of the Art

Summary Diagram