ece 417 guest lecture ranking in heterogeneous network · pagerank as a random surfer • pagerank...

32
ECE 417 Guest Lecture Ranking in Heterogeneous Network Min-Hsuan Tsai Apr 23, 2013

Upload: others

Post on 14-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

ECE 417 Guest LectureRanking in Heterogeneous

NetworkMin-Hsuan Tsai

Apr 23, 2013

Page 2: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Homogeneous (Info) Network

• Information network– Node: an entity– Edge: a relationship between entities

• Homogeneous info network– Nodes are of the same kind

• Social network (friendships, co-authorships, …)• Webpage network• Citation network• Taxonomy

2

A

BC

ED

Page 3: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Heterogeneous (Info) Network

• Multiple types of nodes– Social media network– Bibliographic network– Medical network– …

3

A

B

D E

C

Page 4: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Ranking in Homogeneous Network

• PageRank for webpage network:– PageRank of pages tries to capture Page “Popularity”– Intuitions (impact factor index for journal papers):

• Links are like citations in literature• A page that is cited often can be expected to be more useful in general

– PageRank is essentially “citation voting”, but improves over simple voting

• Consider “soft voting” (each page spreads its vote out evenly to its citation)• Consider “indirect citation” (being cited by a highly cited paper counts a lot

more)• Smoothing of citations (every page is assumed to have a non-zero citation

count)

4

Slide courtesy of Prof. ChenXiang Zhai

Page 5: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

PageRank as a Random Surfer• PageRank of a page is the probability of arriving at that page

after a large number of random clicks• At any page,

1. with probability , randomly picking a link to follow2. with probability (1-), randomly jumping to another page is called damping factor

• Given– pt(di) = probability of visiting page di at time t– Mij = probability of going from di to dj (transition matrix)

• probability of visiting page dj at time t+1 is

5

d1

d2

d4

d3

002/12/100100001

2/12/100

M

11

N

iijM

N

iitN

N

iitijjt dpdpMdp

1

1

11 )()1()()(

Reach dj via following a link Reach dj via random jumping

Page 6: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

PageRank as a Random Surfer

• with =0.8

• Starting with ANY initial , will converge to

6

d1

d2

d4

d3

N

iNitijjt dpMdp

1

11 )1()()(

1111

42.0

)(

)(

)(

)(

005.05.0001000015.05.000

8.0

)(

)(

)(

)(

4

3

2

1

41

31

21

11

dp

dp

dp

dp

dp

dp

dp

dp

t

t

t

tT

t

t

t

t

1886.01886.02763.03465.0

p Probability of visiting each page after a long time (PageRank score)

epMp NtT

t

11

(e: a vector of 1’s)

0p tp

Page 7: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

PageRank as a Markov Chain

• E is a stochastic matrix whose entries are all 1/N• A is still a stochastic matrix• PageRank is a finite Markov Chain (time-homogeneous

Markov Chain w/ finite state space) – The states are the pages– is the probability distribution that pages are visited at time t– The probability of transition follows Markov property

• Aij = Pr(Xt+1=dj|Xt=di)

7

d1

d2

d4

d3

N

iitNijjt dpMdp

1

11 )()]1([)(

tp

ApEMpp ttt ))1((1

Page 8: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

PageRank as a Markov Chain

• A finite Markov chain is irreducible if– There is a path from every node to every other node (strongly

connected).

8

IrreducibleNot irreducible

Page 9: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

PageRank as a Markov Chain

• A state in finite Markov chain is aperiodic if– The greatest common divisor of all cycle length is 1– Periodicity: k = gcd{n: Pr(Xn=i | X0=i) > 0}

– A finite Markov chain is aperiodic if its states are all aperiodic 9

Periodicity is 2 Aperiodic

1 2 1 2

X0=1: 12121212121212…k=gcd{2,4,6,8,…}=2

X0=2: 21212121212121…k=gcd{2,4,6,8,…}=2

X0=2: 122222222222….k=gcd{2,3,4,5,…}=1

X0=1: 121122112221

k=gcd{3,4,5,6,…}=1

Page 10: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

PageRank as a Markov Chain

• Stationary distribution– When distribution does not change anymore

– captures the average probability of visiting each page (PageRank score)

– If a finite Markov chain is irreducible and aperiodic, then the largest eigenvalue of the transition matrix will be equal to 1 and all the other eigenvalues will be strictly less than 1 (Perron-Frobenius theorem)

– Meaning, there exists an unique stationary distribution• The stationary distribution is the left eigenvector of the transition

matrix A corresponding to eigenvalue 1 10

p

App tt ,1

Page 11: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

PageRank as a Markov Chain

• E guarantees the Markov chain is irreducibleand aperiodic– A unique stationary distribution exists!

11

d1

d2

d4

d3

EMAApp tt )1(,1

05.005.045.045.005.005.085.005.005.005.005.085.045.045.005.005.0

4/14/14/14/14/14/14/14/14/14/14/14/14/14/14/14/1

2.0

002/12/1001000012/12/100

8.0

)8.01(8.0 EMA

1886.01886.02763.03465.0

Left eigenvector of A corresponding to eigenvalue 1

Page 12: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

PageRank by hand

• Two ways to calculate the PageRank scores– Let = 1 for now

– Make sure it is irreducible and aperiodic– Solve directly (finding left eigenvector corresponding

to eigenvalue 1)

– Power method

12

1 2

5.05.010

MA

6667.03333.06667.03333.0

,6680.03320.06641.03359.0

,6875.03125.0625.0375.0

,75.025.05.05.0 16842 AAAA

121,5.05.0

102121

pppppp

A

3/23/1

6667.03333.0

Page 13: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Extensions of PageRank

• M can be any stochastic matrix– Not necessary to be uniformly distributed

– Can be based on content similarity (e.g., VisualRank)

• Zi is the normalization factor to make M row-stochastic

13

d1

d2

d4

d3

0.7

0.310.2

0.8

007.03.0001000018.02.000

M

),(1ji

iij IIsim

ZM

Page 14: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Extensions of PageRank

• E can be any stochastic matrix as well– Not necessary to be uniformly distributed (e.g., topic-

sensitive PageRank)

14

05.005.005.005.005.005.005.005.0

05.0

05.0 T

eE

Page 15: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Heterogeneous (Info) Network

• An instance– Social media network

15

Image Domain Text Domain Social Domain

Content Photos or videos shared by the actors

Tags or comments attached to the images

Users share, tag, comment the images;User groups whose members favor images

Homogeneous edges

Content-based visual similarity to the other image node

Semantic similarities between each text content

Members attending the group

Heterogeneous edges

Images may be described, tagged, commented (I-T links) by users (I-A and T-A links), or favored by user groups (I-A links)

Page 16: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Ranking in Heterogeneous Network

• Problem of directly applying PageRank-like algorithm to the heterogeneous network– Edges are of different types of measurements

• How to deal with a image node with an edge to another image node as well as an edge to the user node with favor link?

– Cross-domain edges are usually sparse• Random surfer would easily get trapped in one domain• Meaning slow convergence!

16

like

similar Image 2Image 1

User 1

Page 17: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Our approach• Decomposed Heterogeneous Network

– Then we can employ homogeneous network analysis with PageRank-like algorithm

Page 18: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

• A heterogeneous network can be decomposed into homogeneous sub-networks with heterogeneous links

Decomposition of heterogeneous network

Heterogeneous links

Homogeneous Sub-network

Homogeneous Sub-network

Homogeneous Sub-network

Page 19: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Our approach• Decomposed Heterogeneous Network

– Then we can employ homogeneous network analysis with PageRank-like algorithm

• Augmented Similarity Function (ASF)– Content-based similarity + link-based similarity– Exploit heterogeneous linkage for knowledge

propagation– Extend the idea that two objects are similar if

they are linked to similar objects– May also consider the relevance importance of

each object to the query

Page 20: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

• Heterogeneous links for knowledge propagation

Decomposition of heterogeneous network

Homogeneous Sub-network

Homogeneous Sub-network

Homogeneous Sub-network

Hints

Page 21: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Augmented similarity function• Content-based similarity

• (Homogeneous) similarity metric based on the content of the two objects• Link-based similarity

• Based on the linkages of the two objects to the same object• Based on the linkages of the two objects to the different objects• Based on the weights of the linked objects (s2, s3)

21

Sab = f(s1)

a bCaCb

lac lbclad

lbe

Sde

cd es1 = g1 (Ca, Cb)

, s2)

s2 = g2 (lac, lbc)

s3 = g3 (Sde, lad, lbe)s3 = g3 (Sde, lad, lbe, rd, re)

, s2 , s3), s3)

s2 = g2 (lac, lbc, rc)

rerd

rc

Cross-domain Linkage Relevance importance

Page 22: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Algorithms - SocialRank

• For each homogeneous domain– Augmented similarities obtained from heterogeneous

domains– Content-based linkage with query-sensitive random

surfer model

• Iterative improvement on augmented similarities across different homogeneous sub-networks

Augmented similarity matrix (D: normalization matrix)

zpDSp th

t )1(~ 1

1

Page 23: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Query-biased vector

• To make the random walk query-sensitive, we use a biased vector

– The way the biased vector influence the final prob. distribution is to let the random walk restarts from the biased vector with prob. (1-)

• For query with keywords, the biased vector is the semantic similarity between the query and the content of text nodes

zpDSp th

t )1(~ 1

1

Qqjv

j

vqsimgz ),(

Page 24: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Experimental results – Dataset collection

• None of existing image dataset contains social linkage structure

• We crawled the Flickr site to construct the Flickrgroup dataset– The groups in Flickr are communities with people

who have the same interests toward a target subject– The group members would favor photos which are

closely related to the target subject

Page 25: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Experimental results – Dataset collection

• Statistics of the Flickrgroup dataset– 140 user groups– 118,000 images favored by those 140 user groups

• Encoded with 106 visual words• Similarity measured by

– COT: Number of co-occurrence terms– TF: term frequency of co-occurrence terms– TF-IDF: term frequency of co-occurrence terms penalized by document

frequency

– 150,000 unique tags associated with these images• Top 5000 most frequent ones selected as the codebook tags

Page 26: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Evaluation Measures for Ranking

• Precision– The fraction of retrieved documents that are relevant to

the query

• Recall– The fraction of the documents that are relevant to the

query that are successfully retrieved

26

Page 27: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Evaluation Measures for Ranking

• Precision-recall curve– For each # retrieved document (N), we obtain a (precision, recall) pair– By controlling N to obtain recall varying from 0 to 1, we can connect the

(precision, recall) pairs to obtain a curve

• AP– Average of precision with recall values from 0 to 1– Area under precision-recall curve

• AP@K– Average of precision with a cut-off N from 1 to K– Often used when #relevant documents is too much

• mAP(@K)– Average of AP(@K) over a number of queries

27

Page 28: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Flickr Groups

Favored image pool

Members

Page 29: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Flickr Group – Favored images

Page 30: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Experimental results – Image Ranking (I)

• An 11.5% improvement (on AP@100) over the state-of-the-art ranking methods

Page 31: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Experimental results – Image Ranking (II)

• Consistent improvements on various level of recall and retrieved images

Page 32: ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

Experimental results – Image Ranking (III)

Cat

Bird

Car

Upper row: VisualRank; Bottom row: SocialRank