pagerank without hyperlinks: structural re-ranking using links induced by language models
DESCRIPTION
PageRank without hyperlinks: Structural re-ranking using links induced by language models. Oren Kurland and Lilian Lee Cornell SIGIR 2005. Objective. IR re-ranking on non-hypertext documents using PageRank Use language-model-based weights in the PageRank matrix. Method Outline. - PowerPoint PPT PresentationTRANSCRIPT
PageRank without hyperlinks: Structural re-ranking using links induced by language models
Oren Kurland and Lilian Lee
Cornell
SIGIR 2005
Objective
IR re-ranking on non-hypertext documents using PageRank
Use language-model-based weights in the PageRank matrix
Method Outline
Initial retrieval using KL-Divergence model (use Lemur)
Generate PageRank matrix from top k retrieved documents according to the paper’s model
Do the PageRank iterations
Re-rank the documents
Concept 1: Generation Probability
The probability of a word w occurring in a document x or document collection x acccording to the maximum likelihood model is
tf is the term frequency
Concept 1: Generation Probability (Cont.)
Using the Dirichlet-smoothed model, we get
pcMLE(w) is the MLE probability of w in t
he entire document collection c
controls the influence of pcMLE(w)
Two ways of defining the probability of a document x generating a sequence of words w1w2…wn are
Concept 1: Generation Probability(Cont.)
Concept 1: Generation Probability(Cont.)
KL-Divergence combines the previous two functions into
That’s the generation probability function for this paper
The probability of document d generating word sequence s
Concept 2: Top Generators
The top generators of a document s are the documents d with the highest generation probabilities
Graph Generation
We can construct a graph from a collection of documents
Two ways of defining the edges and edge weights are
Graph Generation (Cont.)
og means an edge from document o to document gThe first definition assigns a uniform weight of 1 to all edges pointing from a document to its top generatorThe second definition uses generation probability as weight
Weight-Smoothing
We can smooth the edge weights to give non-zero weights for all edges
Dinit is the set of documents we wish to re-rank controls the influence of the components
Concept 3: Graph Centrality
Now that we have a graph, how do we define the centrality (importance) of each node (document)?
Influx version:
The centrality of a node is simply the weight of the edges pointing to it
Concept 3: Graph Centrality (Cont.)
Recursive Influx Version:
Centrality is recursively defined
This is the PageRank version
Concept 3: Graph Centrality (Cont.)
We get a total of 4 models if we consider uniform/non-uniform weights and non-recursive/recursive influxesRecall that uniform weights mean edge weights with values 0 or 1
Recursive Non-recursive
Uniform weight
R-U-In U-In
Non-uniform weight
R-W-In W-In
Combining Centrality with Initial Relevance Score
Centrality scores are computed on the set of initially retrieved documentsInitially retrieved documents also have relevance score assigned by KL-divergence retrieval modelWe can combine the two scores:
Cen(d;G) is centrality scorepd(q) is retrieval scoreJust a simple product of the two scores
Final combinations of models
Now we have 8 models:U-InW-InU-In+LM (centrality * retrieval score)W-In+LMR-U-InR-W-InR-U-In+LMR-W-In+LM
Experiment 1: Model Comparison
4 TREC corporaRe-rank top 50 retrieved documentsUpper-bound performance: place all relevant documents in the top 50 documents to the frontInitial ranking: optimize parameter for best precision at 1000Optimal baseline: performance of best parameter
Experiment 1 Results
Highlighted values indicate the best performances
The R-W-In+LM model has the best performance on average
Experiment 2: Cosine Similarity
Top Generators and edge weights are computed using language model pd(s)
Replace pd(s) by tf*idf cosine similarity between 2 documents
Experiment 2: Results
means language model is better than cosine similarity by at least 5%
means cosine similarity is better than language model by 5%
Language model is better overall
Experiment 3: Centrality Alternatives
The best re-ranking model so far is R-W-In+LM:
What if we replace Cen(d;G) by other scores
Experiment 3: Results
Again, R-W-In+LM wins
Conclusion
PageRank on documents without explicit links