internet search engine freshness by web server help

26
Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Upload: varden

Post on 05-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Internet Search Engine freshness by Web Server help. Presented by: Barilari Alessandro. Introduction. Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Internet Search Engine freshness by Web Server help

Internet Search Engine freshness by Web Server help

Presented by:

Barilari Alessandro

Page 2: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari2

Introduction

Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries.

Search engines create their databases by probing web servers on a per-URL basis with a little help from the web servers.

Page 3: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari3

Main Problem

There are no standard for facilitating the push of updates from servers to search engines:– It takes up to six months for a few page to be

indexed by popular web search engines;– The data which is indexed by the search engines

is often stale.

Page 4: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari4

Solution…

Web server help to facilitate search engine freshness results in a favorable situation for web sites, search engines and users.

Page 5: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari5

…and its problems

The number of updates per second is very large.

Must balance between:– The number of interactions between web sites

and search engines, and– The freshness of the search engines.

Page 6: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari6

Page rank impact

Pages which are popular will have higher page ranks:– Use popularity in addition to age and freshness to

compute the mismatch between a web site and a search engine

Page 7: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari7

Summary

Definitions and Cost Model Algorithm Analysis Pratical issues

Page 8: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari8

Some definitions

Update: an update u to a file f is a modification to f that has been flushed to the disk;

Propagation of an update: an update is said to be propagate when the web site has informed the search engine about the update. A SE may or may not retrieve that update;

Meta-update propagation: At any time t, let U(t) be the set of unpropagated updates. The web site informs the search engine about all the updates U(t);

Page 9: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari9

Some definitions (2)

Weight of a file: given a content file, its weight f (non-negative) denotes the importance of the file; the weights are chosen such that:

Last_modification_time(u,t): the last time before t when the file f(u) was updated.

1Ff

f

Page 10: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari10

The Cost Model

Components:– Communication cost;– Opportunity cost: represents the stalenes of the

search engine data as compared to the data on the web server.

CPU cost is ignored

Page 11: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari11

Opportunity cost (OC)

Given an unpropagated update u to a content file f; the opportunity cost for update u at time t is:

OC(u,t)=f(u)x(t - last_modification_time(u,t))

Definition for meta-update propagation:

tUu

tuOCttotOC ,_

Page 12: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari12

Communication cost (CC)

sizef(u)(t): the size of file f(u) at time t;

t.before propagatedbeen not hasu if 0

t,before propagatedbeen hasu if 1,

,,

tu

tutsizektuCC uf

Page 13: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari13

Potential Communication cost (PCC)

Represents the communication cost which would need to be incurred in case update u were to be propagated after time t:

tUu

uf

tuPCCttotPCC

tutsizektuPCC

,_

,1,

Page 14: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari14

The Cost Function

Given that an update u is unpropagated at time t, the cost function for that update at time t is given by:

ttotCCttotOCttotCost

tuOCtuCCtuCost

___

,,,

Page 15: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari15

Summary

Definition and Cost Model

Algorithm Analysis Pratical issues

Page 16: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari16

FreshFlow Algorithm

When OC_tot equals PCC_tot at any time t, the web server can inform the search engine

about all the unpropagated updates.

Page 17: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari17

Summary

Definition and Cost Model Algorithm

Analysis Pratical issues

Page 18: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari18

Analysis

The cost of the FreshFlow algorithm (called FF) is compared with the cost of an optimal off-line algorithm (called ADV)

Page 19: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari19

Analysis (2)

Lemma (1): OC(u,t) is monotonically non-decreasing;

Lemma (2): suppose an update u to a file f, and suppose FF transmits but ADV does not. Then OCADV(u,t)≥OCFF(u,t).

Lemma (3): if the update is transmitted by the adversary (ADV), then CCADV(u,t) ≥CCFF(u,t).

Page 20: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari20

Theorem

FF is 2-competitive:

CostFF(u,t) ≤ 2 x CostADV(u,t)

Page 21: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari21

Summary

Definition and Cost Model Algorithm Analysis

Pratical issues

Page 22: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari22

Pratical issues

There are multiple search engines:– Synchronization effect: pushing the updates

would put pressure on the last-hop link to the web server;

– Search engine load: some search engines might deny the receipt of updates.

Page 23: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari23

The middleman approach

Each web server contacts only one middleman for sending its updates;

Could be a group of middlemen.

Page 24: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari24

Benefits

The middleman can solve some additional issues:– Verifying trustworthiness of web servers;– Restricting the rate at which updates get

transmitted to search engines;

Page 25: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari25

Limitations

The algorithm has not been used in practice; The search engines need the cooperation of

the web servers to keep track of updates to their URLs. Whether web servers will incorporate such a service remains to be seen.

Page 26: Internet Search Engine freshness by Web Server help

Mining di Dati Web Alessandro Barilari26

Conclusions

The FreshFlow algorithm is a solution that improve the data updates of the search engines, mantaining high level efficiency and performance;

The authors are planning to implement the algorithm in a real system (and have a future pubblication!)