internet search engine freshness by web server help presented by: barilari alessandro

26
Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Upload: aaliyah-mcguire

Post on 28-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Internet Search Engine freshness by Web Server help

Presented by:

Barilari Alessandro

Page 2: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari2

Introduction

Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries.

Search engines create their databases by probing web servers on a per-URL basis with a little help from the web servers.

Page 3: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari3

Main Problem

There are no standard for facilitating the push of updates from servers to search engines:– It takes up to six months for a few page to be

indexed by popular web search engines;– The data which is indexed by the search engines

is often stale.

Page 4: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari4

Solution…

Web server help to facilitate search engine freshness results in a favorable situation for web sites, search engines and users.

Page 5: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari5

…and its problems

The number of updates per second is very large.

Must balance between:– The number of interactions between web sites

and search engines, and– The freshness of the search engines.

Page 6: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari6

Page rank impact

Pages which are popular will have higher page ranks:– Use popularity in addition to age and freshness to

compute the mismatch between a web site and a search engine

Page 7: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari7

Summary

Definitions and Cost Model Algorithm Analysis Pratical issues

Page 8: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari8

Some definitions

Update: an update u to a file f is a modification to f that has been flushed to the disk;

Propagation of an update: an update is said to be propagate when the web site has informed the search engine about the update. A SE may or may not retrieve that update;

Meta-update propagation: At any time t, let U(t) be the set of unpropagated updates. The web site informs the search engine about all the updates U(t);

Page 9: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari9

Some definitions (2)

Weight of a file: given a content file, its weight f (non-negative) denotes the importance of the file; the weights are chosen such that:

Last_modification_time(u,t): the last time before t when the file f(u) was updated.

1Ff

f

Page 10: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari10

The Cost Model

Components:– Communication cost;– Opportunity cost: represents the stalenes of the

search engine data as compared to the data on the web server.

CPU cost is ignored

Page 11: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari11

Opportunity cost (OC)

Given an unpropagated update u to a content file f; the opportunity cost for update u at time t is:

OC(u,t)=f(u)x(t - last_modification_time(u,t))

Definition for meta-update propagation:

tUu

tuOCttotOC ,_

Page 12: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari12

Communication cost (CC)

sizef(u)(t): the size of file f(u) at time t;

t.before propagatedbeen not hasu if 0

t,before propagatedbeen hasu if 1,

,,

tu

tutsizektuCC uf

Page 13: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari13

Potential Communication cost (PCC)

Represents the communication cost which would need to be incurred in case update u were to be propagated after time t:

tUu

uf

tuPCCttotPCC

tutsizektuPCC

,_

,1,

Page 14: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari14

The Cost Function

Given that an update u is unpropagated at time t, the cost function for that update at time t is given by:

ttotCCttotOCttotCost

tuOCtuCCtuCost

___

,,,

Page 15: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari15

Summary

Definition and Cost Model

Algorithm Analysis Pratical issues

Page 16: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari16

FreshFlow Algorithm

When OC_tot equals PCC_tot at any time t, the web server can inform the search engine

about all the unpropagated updates.

Page 17: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari17

Summary

Definition and Cost Model Algorithm

Analysis Pratical issues

Page 18: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari18

Analysis

The cost of the FreshFlow algorithm (called FF) is compared with the cost of an optimal off-line algorithm (called ADV)

Page 19: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari19

Analysis (2)

Lemma (1): OC(u,t) is monotonically non-decreasing;

Lemma (2): suppose an update u to a file f, and suppose FF transmits but ADV does not. Then OCADV(u,t)≥OCFF(u,t).

Lemma (3): if the update is transmitted by the adversary (ADV), then CCADV(u,t) ≥CCFF(u,t).

Page 20: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari20

Theorem

FF is 2-competitive:

CostFF(u,t) ≤ 2 x CostADV(u,t)

Page 21: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari21

Summary

Definition and Cost Model Algorithm Analysis

Pratical issues

Page 22: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari22

Pratical issues

There are multiple search engines:– Synchronization effect: pushing the updates

would put pressure on the last-hop link to the web server;

– Search engine load: some search engines might deny the receipt of updates.

Page 23: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari23

The middleman approach

Each web server contacts only one middleman for sending its updates;

Could be a group of middlemen.

Page 24: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari24

Benefits

The middleman can solve some additional issues:– Verifying trustworthiness of web servers;– Restricting the rate at which updates get

transmitted to search engines;

Page 25: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari25

Limitations

The algorithm has not been used in practice; The search engines need the cooperation of

the web servers to keep track of updates to their URLs. Whether web servers will incorporate such a service remains to be seen.

Page 26: Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

Mining di Dati Web Alessandro Barilari26

Conclusions

The FreshFlow algorithm is a solution that improve the data updates of the search engines, mantaining high level efficiency and performance;

The authors are planning to implement the algorithm in a real system (and have a future pubblication!)