pagerank without hyperlinks: structural re-ranking using links induced by language models

PageRank without hyperlinks: Structural re-ranking using links induced by language models

Oren Kurland and Lilian Lee

Cornell

SIGIR 2005

Objective

IR re-ranking on non-hypertext documents using PageRank

Use language-model-based weights in the PageRank matrix

Method Outline

Initial retrieval using KL-Divergence model (use Lemur)

Generate PageRank matrix from top k retrieved documents according to the paper’s model

Do the PageRank iterations

Re-rank the documents

Concept 1: Generation Probability

The probability of a word w occurring in a document x or document collection x acccording to the maximum likelihood model is

tf is the term frequency

Concept 1: Generation Probability (Cont.)

Using the Dirichlet-smoothed model, we get

pcMLE(w) is the MLE probability of w in t

he entire document collection c

controls the influence of pcMLE(w)

Two ways of defining the probability of a document x generating a sequence of words w1w2…wn are

Concept 1: Generation Probability(Cont.)

Concept 1: Generation Probability(Cont.)

KL-Divergence combines the previous two functions into

That’s the generation probability function for this paper

The probability of document d generating word sequence s

Concept 2: Top Generators

The top generators of a document s are the documents d with the highest generation probabilities

Graph Generation

We can construct a graph from a collection of documents

Two ways of defining the edges and edge weights are

Graph Generation (Cont.)

og means an edge from document o to document gThe first definition assigns a uniform weight of 1 to all edges pointing from a document to its top generatorThe second definition uses generation probability as weight

Weight-Smoothing

We can smooth the edge weights to give non-zero weights for all edges

Dinit is the set of documents we wish to re-rank controls the influence of the components

Concept 3: Graph Centrality

Now that we have a graph, how do we define the centrality (importance) of each node (document)?

Influx version:

The centrality of a node is simply the weight of the edges pointing to it

Concept 3: Graph Centrality (Cont.)

Recursive Influx Version:

Centrality is recursively defined

This is the PageRank version

Concept 3: Graph Centrality (Cont.)

We get a total of 4 models if we consider uniform/non-uniform weights and non-recursive/recursive influxesRecall that uniform weights mean edge weights with values 0 or 1

Recursive Non-recursive

Uniform weight

R-U-In U-In

Non-uniform weight

R-W-In W-In

Combining Centrality with Initial Relevance Score

Centrality scores are computed on the set of initially retrieved documentsInitially retrieved documents also have relevance score assigned by KL-divergence retrieval modelWe can combine the two scores:

Cen(d;G) is centrality scorepd(q) is retrieval scoreJust a simple product of the two scores

Final combinations of models

Now we have 8 models:U-InW-InU-In+LM (centrality * retrieval score)W-In+LMR-U-InR-W-InR-U-In+LMR-W-In+LM

Experiment 1: Model Comparison

4 TREC corporaRe-rank top 50 retrieved documentsUpper-bound performance: place all relevant documents in the top 50 documents to the frontInitial ranking: optimize parameter for best precision at 1000Optimal baseline: performance of best parameter

Experiment 1 Results

Highlighted values indicate the best performances

The R-W-In+LM model has the best performance on average

Experiment 2: Cosine Similarity

Top Generators and edge weights are computed using language model pd(s)

Replace pd(s) by tf*idf cosine similarity between 2 documents

Experiment 2: Results

means language model is better than cosine similarity by at least 5%

means cosine similarity is better than language model by 5%

Language model is better overall

Experiment 3: Centrality Alternatives

The best re-ranking model so far is R-W-In+LM:

What if we replace Cen(d;G) by other scores

Experiment 3: Results

Again, R-W-In+LM wins

Conclusion

PageRank on documents without explicit links

pagerank without hyperlinks: structural re-ranking using links induced by language models

Documents