the lifespan, accessibility and archiving of dynamic documents

21
The lifespan, accessibility and archiving of dynamic documents. K. Wegrzyn-Wolska [email protected] ESIGETEL Ecole Supérieure d’Ingénieurs en Informatique et Génie des Télécommunications ________________________________________________ ________________________________________________ ____ ____ OSWIR 2005 - Workshop on Open Source Web Information Retrieval in association with the 2005 IEEE/WIC/ACM International Conferences on Web Intelligence

Upload: arnon

Post on 14-Jan-2016

34 views

Category:

Documents


2 download

DESCRIPTION

The lifespan, accessibility and archiving of dynamic documents. K. Wegrzyn-Wolska [email protected] ESIGETEL Ecole Sup é rieure d’Ing é nieurs en Informatique et G é nie des T é l é communications. ____________________________________________________ - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The lifespan, accessibility and archiving of dynamic documents

The lifespan, accessibility and archiving of dynamic documents.

K. Wegrzyn-Wolska [email protected]

ESIGETEL

Ecole Supérieure d’Ingénieurs en Informatique et Génie des Télécommunications

________________________________________________________________________________________________________OSWIR 2005 - Workshop on Open Source Web Information Retrieval

in association with the2005 IEEE/WIC/ACM International Conferences on Web Intelligence

Page 2: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 2

Presentation

Introduction The Lifespan and Age of Dynamic Documents Dynamic Document Categories

News published on the Web Weblog sites Search Engines

Archiving Statistical Evaluation Conclusions

Page 3: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 3

Introduction

Some questions : Real documents? Temporary presentation of data? Created automatically? Created as a response to the users' questions? HTML page with some dynamic parts?

(layers, scripts, etc.) Created on-line by the Web server?

Page 4: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 4

Dynamic documents categories

individual demands from the user : Results from Search Engines, Response from data forms, etc.

automatically by specialised application : different News sites, forums, etc.

different behaviour and characteristics

Page 5: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 5

The Lifespan and Age of Dynamic Documents

Question : How to evaluate the lifespan of dynamic

documents? documents disappear immediately from the

computers’ memory after consultation How to determine age of the dynamic

documents? http header Modified and Expired or the value

fixed in HTML file with the META tag

Answer : period where the response for the same questions

doesn’t change period it is the lifespan visible by the user

Page 6: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 6

The Lifespan and Age of Dynamic Documents

News sites Various informations

General sites, World news Local news, Etc.

characteristics : Automatically created, Updated instantaneously, Archiving

Page 7: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 7

The Lifespan and Age of Dynamic Documents

news url updating Archiving

French Google http://news.google.fr ~20 min 30 days

Google http://news.google.com ~20 min 30 days

Voilà News http://actu.voila.fr/ 1 day 1 week

Voila http://actu.voila.fr/Depeche/

(~30min) 1 week

CNN http://www.cnn.com/    

Yahoo!News http://fr.news.yahoo.com/ instantaneously 1 week

TF1 news http://news.tf1.fr/news/ instantaneously  

News now http://www.newsnow.co.uk/

5 min  

Les Infos http://www.lesinfos.com/   Since 2000

CategoryNet http://www.categorynet.com 

Every day Never ending

CompanynewsGroup

http://www.companynewsgroup.com

40 per day 2003 & 2004archived;1999 – 2003 in a futur

Page 8: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 8

The Lifespan and Age of Dynamic Documents

Weblogs: Daily updated, Form:

Web page modified, Varied information,

General: Dynamic pages Regulary updated

Page 9: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 9

The Lifespan and Age of Dynamic Documents

Search Engines : response :

Dynamic pages created on-line

Lifespan, accessibility : period when the Search Engine's answer doesn’t

change , updating frequency of index-databases Example :

Google : 4 weeks

Page 10: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 10

Archiving

Question ?  how to store (archive) the dynamics documents ?

Solution : simple : printed version saved by the user put into special caching and archiving systems

applications, which try to save up to date Web image example : Wayback Machine

Page 11: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 11

Archiving (Google News by WeybackMachine)

Page 12: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 12

Archiving (ActuVoila by WeybackMachine)

Page 13: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 13

Statistical evaluation

index-database updating frequency : frequency analysis for the indexing robots visits used

by Search Engines

statistic tests (news et Weblog sites)

Page 14: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 14

Statistical evaluation

4 categories of different sites : Sportstrategies the sport news service modified very regular,

very regular News site,with a constant update time(every hour

BBC broadcast on-line, the lifespan is very irregular because the information is updated

instantaneously , when available

TF1, updated frequently during the day no modifications during the night

Weblog (Slashdot.org) changes here very quickly, new articles are broadcast very often the current discussions continue incessantly lifespan of these dynamically changed pages is extremely short

Page 15: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 15

Statistical evaluation

News of JO 2004 at Athens broadcast by the site Sportstrategies

Page 16: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 16

Statistical evaluation

BBC News

Page 17: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 17

Statistical evaluation

TF1 News : 24/24

Page 18: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 18

The statistical evaluation

TF1 News : working hours

Page 19: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 19

The statistical evaluation

Weblog Slashdot

Page 20: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 20

The statistical evaluation

Lifespan Comparison

Testedservice

Lifespan

mean min max

Slashdot.org 77 sec 10 sec 22 min

BBC.News 8,5 min 1 min 66 min

TF1.news(24/24) 19,5 min 1 min 502 min

TF1.news (working hours)

6,3 min 1 min 49 min

Sportsynergies 56 min 9 min 61 min

Page 21: The lifespan, accessibility and archiving of dynamic documents

15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 21

Conclusion

Dynamic documents: don’t exist in reality, disappear from the computer memory directly after

consultation. real lifespan is very short

But… Can be accessible for a long time and can be stored

by special archiving systems Management of the archived dynamic documents:

lifespan is identical to that of static documents, because the dynamic documents are stored in the same way as static ones.