internet search engine freshness by web server help
DESCRIPTION
Internet Search Engine freshness by Web Server help. Presented by: Barilari Alessandro. Introduction. Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries. - PowerPoint PPT PresentationTRANSCRIPT
Internet Search Engine freshness by Web Server help
Presented by:
Barilari Alessandro
Mining di Dati Web Alessandro Barilari2
Introduction
Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries.
Search engines create their databases by probing web servers on a per-URL basis with a little help from the web servers.
Mining di Dati Web Alessandro Barilari3
Main Problem
There are no standard for facilitating the push of updates from servers to search engines:– It takes up to six months for a few page to be
indexed by popular web search engines;– The data which is indexed by the search engines
is often stale.
Mining di Dati Web Alessandro Barilari4
Solution…
Web server help to facilitate search engine freshness results in a favorable situation for web sites, search engines and users.
Mining di Dati Web Alessandro Barilari5
…and its problems
The number of updates per second is very large.
Must balance between:– The number of interactions between web sites
and search engines, and– The freshness of the search engines.
Mining di Dati Web Alessandro Barilari6
Page rank impact
Pages which are popular will have higher page ranks:– Use popularity in addition to age and freshness to
compute the mismatch between a web site and a search engine
Mining di Dati Web Alessandro Barilari7
Summary
Definitions and Cost Model Algorithm Analysis Pratical issues
Mining di Dati Web Alessandro Barilari8
Some definitions
Update: an update u to a file f is a modification to f that has been flushed to the disk;
Propagation of an update: an update is said to be propagate when the web site has informed the search engine about the update. A SE may or may not retrieve that update;
Meta-update propagation: At any time t, let U(t) be the set of unpropagated updates. The web site informs the search engine about all the updates U(t);
Mining di Dati Web Alessandro Barilari9
Some definitions (2)
Weight of a file: given a content file, its weight f (non-negative) denotes the importance of the file; the weights are chosen such that:
Last_modification_time(u,t): the last time before t when the file f(u) was updated.
1Ff
f
Mining di Dati Web Alessandro Barilari10
The Cost Model
Components:– Communication cost;– Opportunity cost: represents the stalenes of the
search engine data as compared to the data on the web server.
CPU cost is ignored
Mining di Dati Web Alessandro Barilari11
Opportunity cost (OC)
Given an unpropagated update u to a content file f; the opportunity cost for update u at time t is:
OC(u,t)=f(u)x(t - last_modification_time(u,t))
Definition for meta-update propagation:
tUu
tuOCttotOC ,_
Mining di Dati Web Alessandro Barilari12
Communication cost (CC)
sizef(u)(t): the size of file f(u) at time t;
t.before propagatedbeen not hasu if 0
t,before propagatedbeen hasu if 1,
,,
tu
tutsizektuCC uf
Mining di Dati Web Alessandro Barilari13
Potential Communication cost (PCC)
Represents the communication cost which would need to be incurred in case update u were to be propagated after time t:
tUu
uf
tuPCCttotPCC
tutsizektuPCC
,_
,1,
Mining di Dati Web Alessandro Barilari14
The Cost Function
Given that an update u is unpropagated at time t, the cost function for that update at time t is given by:
ttotCCttotOCttotCost
tuOCtuCCtuCost
___
,,,
Mining di Dati Web Alessandro Barilari15
Summary
Definition and Cost Model
Algorithm Analysis Pratical issues
Mining di Dati Web Alessandro Barilari16
FreshFlow Algorithm
When OC_tot equals PCC_tot at any time t, the web server can inform the search engine
about all the unpropagated updates.
Mining di Dati Web Alessandro Barilari17
Summary
Definition and Cost Model Algorithm
Analysis Pratical issues
Mining di Dati Web Alessandro Barilari18
Analysis
The cost of the FreshFlow algorithm (called FF) is compared with the cost of an optimal off-line algorithm (called ADV)
Mining di Dati Web Alessandro Barilari19
Analysis (2)
Lemma (1): OC(u,t) is monotonically non-decreasing;
Lemma (2): suppose an update u to a file f, and suppose FF transmits but ADV does not. Then OCADV(u,t)≥OCFF(u,t).
Lemma (3): if the update is transmitted by the adversary (ADV), then CCADV(u,t) ≥CCFF(u,t).
Mining di Dati Web Alessandro Barilari20
Theorem
FF is 2-competitive:
CostFF(u,t) ≤ 2 x CostADV(u,t)
Mining di Dati Web Alessandro Barilari21
Summary
Definition and Cost Model Algorithm Analysis
Pratical issues
Mining di Dati Web Alessandro Barilari22
Pratical issues
There are multiple search engines:– Synchronization effect: pushing the updates
would put pressure on the last-hop link to the web server;
– Search engine load: some search engines might deny the receipt of updates.
Mining di Dati Web Alessandro Barilari23
The middleman approach
Each web server contacts only one middleman for sending its updates;
Could be a group of middlemen.
Mining di Dati Web Alessandro Barilari24
Benefits
The middleman can solve some additional issues:– Verifying trustworthiness of web servers;– Restricting the rate at which updates get
transmitted to search engines;
Mining di Dati Web Alessandro Barilari25
Limitations
The algorithm has not been used in practice; The search engines need the cooperation of
the web servers to keep track of updates to their URLs. Whether web servers will incorporate such a service remains to be seen.
Mining di Dati Web Alessandro Barilari26
Conclusions
The FreshFlow algorithm is a solution that improve the data updates of the search engines, mantaining high level efficiency and performance;
The authors are planning to implement the algorithm in a real system (and have a future pubblication!)