adversarial information retrieval
DESCRIPTION
Adversarial Information Retrieval. The Manipulation of Web Content. Introduction. Examples TrustRank and Other Methods. What is Adversarial IR?. Gathering, Indexing, Retrieving and Ranking Information Subset of the information has been manipulated maliciously Financial Gain. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/1.jpg)
Adversarial Information Retrieval
The Manipulation of Web Content
![Page 2: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/2.jpg)
Introduction
• Examples• TrustRank and Other Methods
![Page 3: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/3.jpg)
What is Adversarial IR?
• Gathering, Indexing, Retrieving and Ranking Information
• Subset of the information has been manipulated maliciously
• Financial Gain
![Page 4: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/4.jpg)
What is the Goal of AIR?
• Detect the bad sites or communities
• Improve precision on search engines by eliminating the bad guys
![Page 5: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/5.jpg)
Simplest form
• First generation engines relied heavily on tf/idf – The top-ranked pages for the query maui resort were the
ones containing the most maui’s and resort’s• SEOs responded with dense repetitions of chosen terms
– e.g., maui resort maui resort maui resort – Often, the repetitions would be in the same color as the
background of the web page• Repeated terms got indexed by crawlers• But not visible to humans on browsers
Pure word density cannot be trusted as an IR signal
![Page 6: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/6.jpg)
Search Engine Spamming
• Link-spam• Link-bombing• Spam Blogs• Comment Spam• Keyword Spam• Malicious Tagging
![Page 7: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/7.jpg)
Spamming
• Online tutorials for “search engine persuasion techniques”– “How to boost your PageRank”
• Artificial links and Web communities• Latest trend: “Google bombing”– a community of people create (genuine) links with
a specific anchor text towards a specific page. Usually to make a political point
![Page 8: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/8.jpg)
Google Bombing
![Page 9: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/9.jpg)
Our Focus
• Link Manipulation
![Page 10: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/10.jpg)
Trust Rank
• Observation– Good pages tend to link good pages.– Human is the best spam detector
• Algorithm– Select a small subset of pages and let a human
classify them– Propagate goodness of pages
10
![Page 11: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/11.jpg)
Propagation
• Trust function T – T(p) returns the propability that p is a good page
• Initial values– T(p) = 1, if p was found to be a good page– T(p) = 0, if p was found to be a spam page
• Iterations:– propagate Trust following out-links – only a fixed number of iteration M.
11
![Page 12: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/12.jpg)
Propagation (2)
• Problem with propagation– Pages reachable from
good seeds might not be good
– the further away we are from good seed pages, the less certain we are that a page is good.
12
– solution: reduce trust as we move further away from the good seed pages (trust attenuation).
![Page 13: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/13.jpg)
Trust attenuation – dampening
– Propagate a dampened trust score ß < 1 at first step
– At n-th step propagate a trust of ß^n
13
![Page 14: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/14.jpg)
Trust attenuation – splitting
– Parent trust value is splittet among child nodes– Observation: the more the links the less the
care in choosing them– Mix damp and split? ß^n(splitted trust)
14
![Page 15: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/15.jpg)
Selection – Inverse PageRank
• The seed set S should:– be as small as possible– cover a large part of the Web
• Covering is related to out-links in the very same way PageRank is related to in-link– Inverse PageRank !
• Perform PageRank on a graph with inverted links– G' = (V, E') where (p,q) E' (q, p) E.
15
![Page 16: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/16.jpg)
Algorithm
1. Select seeds ( s ) and order by preference
2. Invoke oracle (human) on the first L seeds,
3. Initialize and normalize oracle response d
4. Compute TrustRank score (as in PageRank formula):t* = ß ·T·t*+(1−ß) ·d
T is the adjacency matrix of the Web Graph.
ß is the dampening factor. (usually .85)
16
![Page 17: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/17.jpg)
Algorithm - example
– s = [0.08, 0.13, 0.08, 0.10, 0.09, 0.06, 0.02]– Ordering = [2, 4, 5, 1, 3, 6, 7]
– L=3 {2, 4, 5} d=[0, 0.5, 0, 0.5, 0, 0, 0]– ß=0.85 M=20– t* = [0, 0.18, 0.12, 0.15, 0.13, 0.05,
0.05]
– NB. max=0.18– Issues with page 1 and 5
17
![Page 18: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/18.jpg)
Issues with TrustRank
• Coverage of the seed set may not be broad enough Many different topics exist, each with good pages
• TrustRank has a bias towards communities that are heavily represented in the seed set inadvertently helps spammers that fool these
communities
![Page 19: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/19.jpg)
Bias towards larger partitions
• Divide the seed set into n partitions, each has mi nodes
• ti : TrustRank score calculated by using partition i as the seed set
• t : TrustRank score calculated by using all the partitions as one combined seed set
nn
i i
nn
i i
n
i i
tm
mt
m
mt
m
mt
1
2
1
21
1
1 ...
![Page 20: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/20.jpg)
Basic ideas• Use pages labeled with topics as seed pages
Pages listed in highly regarded topic directories
• Trust should be propagated by topics link between two pages is usually created in a
topic specific context
![Page 21: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/21.jpg)
Topical TrustRank• Topical TrustRank
Partition the seed set into topically coherent groups TrustRank is calculated for each topic Final ranking is generated by a combination of these topic
specific trust scores• Note
TrustRank is essentially biased PageRank Topical TrustRank is fundamentally the same as Topic-
Sensitive PageRank, but for demoting spam
![Page 22: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/22.jpg)
Combination of trust scores
• Simple summation default mechanism just seen
• Quality bias Each topic weighted by a bias factor Summation of these weighted topic scores
One possible bias: Average PageRank value of the seed pages of the topic
ntttt ...21
nntwtwtwt ...2211
![Page 23: Adversarial Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022070411/568147aa550346895db4e5ac/html5/thumbnails/23.jpg)
Further Improvements
• Seed Weighting Instead of assigning an equal weight to each seed page,
assign a weight proportional to its quality / importance• Seed Filtering
Filtering out low quality pages that may exist in topic directories
• Finer topics Lower layers of the topic directory