the efficacy of collusions in web ranking and the countermeasurements
DESCRIPTION
The Efficacy of Collusions in Web Ranking and the Countermeasurements. Hui Zhang University of Southern California. Outline. Problem Statement. PageRank algorithm : a brief introduction. Study of PageRank’s robustness to collusion. - PowerPoint PPT PresentationTRANSCRIPT
The Efficacy of Collusions in Web Ranking
and the Countermeasurements
Hui ZhangUniversity of Southern California
04/22/23 USC CS599 2
• Problem Statement.
• PageRank algorithm : a brief introduction.
• Study of PageRank’s robustness to collusion.
• Adaptive-resetting: make PageRank robust to collusion.
• Conclusions.
Outline
04/22/23 USC CS599 3
Search Engine Optimization (SEO)
04/22/23 USC CS599 4
Web spam [Gyongyin et al. 2004]
• Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve.
• A spammer will play with two factors which decide the rank score of a page in a query:
Relevance – textual similarity between the query and a page.
Importance – the global popularity of a page, which is query-independent.
04/22/23 USC CS599 5
Collusion in Web ranking
• A manipulation of the hyperlink structure by a group of users with the intention of improving the rating one or more users in the group.
04/22/23 USC CS599 6
PageRank [Brin1998] • An eigenvector-based rating scheme to
rank hypertext documents on the WWW.
• An iterative algorithm to calculate the importance of a web page based on the importance of its parent pages.
• Can be applied to other systems than WWW.
04/22/23 USC CS599 7
PageRank: random walk modelnode
referential linkThe walker
1/21/3
X
Y Z
• As time goes on, the expected percentage of steps the walker is at each node v converges to the PageRank weight PR(v).
With prob. (1-), I will continue the walk to a random successor node.
: resetting probability
With prob. , I will restart the walk at a random node.
: resetting probability
04/22/23 USC CS599 8
PageRank: is it collusion-proof?• Can a node easily boost its rank by
manipulating its out-going links with others’?
I’m not colluding!
I’m not colluding!
I’m not colluding!
I’m not colluding!
04/22/23 USC CS599 9
• In the system of node group G, for a subgroup G’,
the amplification factor Amp(G’) =
)'()'(
GWGW
in
G
':
)()'(Gii
G iPRGW
))'(1(|||'|)1(
)()()'(
',',:),(
GWGG
ioutiPRGW G
jiGjGijiin
WG(G’) =PR(i)+PR(j)
Win(G’) =
real group weight
“actual” group weight
Amp(G): a metric on group collusion
PR(x)3
PR(x)3 (1-) PR(y)
2PR(y)4PR(y)
4
+ (1-)
+ 2N
(1-W(G’))xy
GG’ i j
: resetting probability
04/22/23 USC CS599 10
Answer for (1+1 = ?) in PageRank
• In the original PageRank system,
where is the resetting probability.
2)'(,' GAmpGG
1)'(,1
|||'|, GAmp
GGwhenlySpecifical
04/22/23 USC CS599 11
Two experimental topologies• W, a Web link topology
Contains the link structure of upwards of 80 million URLs.
Source: the Stanford WebBase.
• B, a weblog blogrolling topologyContains the blogrolling structure of upwards of 72,000 blogs.
Source: www.blogstreet.com, the XML-RPC webblog service.
04/22/23 USC CS599 12
• Model a small number of web pages simultaneously colluding.
• Methodology:
•100 colluding groups of 200 nodes;
•Each colluding group has the circle topology consisting of two nodes with adjacent ranks;
•Arbitrarily chose node pairs originally ranked around 1000th, 2000th, …, 100000th.
= 0.15.
Experiment 1: Collusion200
04/22/23 USC CS599 13
Experiment result of Collusion200 (I)
Figure 1: W - Amplification factors of the 100 colluding groups in Collusion200.
04/22/23 USC CS599 14
Experiment result of Collusion200 (III)
Figure 2: W – new PR rank after Collusion200.
Old rank: 1005th
New rank: 67th
Old rank: 10001th
New rank: 450th
Old rank: 100009th
New rank: 5038th
04/22/23 USC CS599 15
There is a long flat portion…
Figure 3: The PR weight distribution of 4 topologies.
04/22/23 USC CS599 16
• Identifying colluding groups is unlikely to be computationally tractable.•The densest k-subgraph problem[Feige et al. 1997].
•The classical CLIQUE problem.
•The problem of finding hiding large cliques in random graphs[Juels 1998].
Next step: how to detect collusions?
04/22/23 USC CS599 17
• Theorem on Hardness.
Max G’G Amp(G’) is a NP-Hard problem.
Hardness on Amp
04/22/23 USC CS599 18
• The revisit intervals of the random walk on a colluding node will likely to have a large variance compared to its expectation.
A counterexample: a star+dangling circle topology
0
12
N N+1
N-1N-2
Figure E:
How about using finer statistics of the random walk
04/22/23 USC CS599 19
An observation on collusion behaviors• To increase their PR weight, i.e., the
stationary weight in the random walk, the colluding nodes will stall the random walk.
• When the resetting probability increases, the colluding nodes must suffer a significant drop in PR weight.
• Therefore, we expect the PR weight of colluding nodes to be highly correlated with 1/ (the average walk length), while that of non-colluding nodes is relatively insensitive to the change in .
GG’
04/22/23 USC CS599 20
An intuitive examplenode
referential link
04/22/23 USC CS599 21
An intuitive examplenode
referential link
A colluding group
04/22/23 USC CS599 22
An intuitive examplenode
referential link
A colluding group
• A colluding node x: PR(x) = , and co-co(PR(x), 1/ ) 1. (co-co: correlation coefficient)
• A non-colluding node y: PR(x) = , and co-co(PR(y), 1/ ) 0.
NKNK1
)(1
NKNK1
)(
x
y
N: the system size; K: the colluding group size; K << N.
04/22/23 USC CS599 23
• Part I – collusion detection:Given the topology, calculate the PR vector under different values.
{} = {0.0375, 0.05, 0.075, 0.15, 0.3, 0.45, 0.6}, default = 0.15.
Calculate the correlation coefficient between the curve of each node x's PR weight and the curve of 1/ . Label it as co-co(x).
• Part II – personalization:Calculate each node x's out-link personalized- = F(default, co-co(x)).
Exponential function FExp= .
Linear function FLinear= default+(0.5-default)*co-co(x)
The final PR weight vector is calculated with these personalized resetting values.
))(0.1( xcocodefault
Adaptive-resetting scheme
04/22/23 USC CS599 24
Experiment result of Collusion200 (IV)
Figure 5: W - Amplification factors of the 100 colluding groups in Collusion200.
04/22/23 USC CS599 25
Experiment result of Collusion200 (VI)
Figure 6: W – new PR rank after Collusion200.
04/22/23 USC CS599 26
• Model various colluding subgraphs.• Methodology:
3 colluding groups:node
referential link
G1: 10-node ring G2: 10-node star topology
G3: 2-node ring
Experiment 2: Collusion22
04/22/23 USC CS599 27
Experiment result of Collusion22 (I)
Figure 7: Amplification factors of the 3 colluding groups in Collusion22.
04/22/23 USC CS599 28
Experiment result of Collusion22 (II)
Figure 8: W – new PR weight after Collusion22.
04/22/23 USC CS599 29
New top-25 URL list in W Dropped outDropping New
04/22/23 USC CS599 30
Conclusions• Simple collusions lead to effective Web ranking
improvement.
• A simple scheme based on PageRank algorithm effectively counteracts Web ranking collusions.
04/22/23 USC CS599 31
Backup slides
04/22/23 USC CS599 32
• A means of describing social trust networks.
• The basic concept is a democratic meritocracy.
• A rating system is used to evaluate individual members, and those results are then collated to produce a consensus about the merit of any given member.
• Examples: Livejournal, Friendster, eBay, Advogato
Reputation systems [Okita2003]
04/22/23 USC CS599 33
• Assume N pages.• Assign all pages the initial value 1/N• Let Nu be the out-degree of Page u, Rank(v)
the importance of Page v, Bv the set of pages pointing to v. • Basic algorithmv Rank(v) =
vBuuNuRank /)(
• Enhanced algorithm against rank sinksv Rank(v) =
vBu
u NNuRank //)()1(
: damping factor
PageRank algorithm [Brin1998]
04/22/23 USC CS599 34
Figure 4: the co-co PDF distribution in W and B: the [0, 0.1] range actually corresponds to [-1, 0.1] range.
Co-co distribution in real-world graphs
04/22/23 USC CS599 35
Figure A: W – new PR weight after Collusion200.
Experiment result of Collusion200 (II)
04/22/23 USC CS599 36
Figure B: B – new PR rank after Collusion200
Experiment result of Collusion200 (VII)
04/22/23 USC CS599 37
Figure C: B – new PR weight after Collusion200
Experiment result of Collusion200 (X)
04/22/23 USC CS599 38
Figure 6: W – new PR weight after Collusion200.
Experiment result of Collusion200 (V)
04/22/23 USC CS599 39
Correlation coefficient
04/22/23 USC CS599 40
Figure D: W – new PR rank after Collusion22.
Experiment result of Collusion22 (III)