the lifespan, accessibility and archiving of dynamic documents
DESCRIPTION
The lifespan, accessibility and archiving of dynamic documents. K. Wegrzyn-Wolska [email protected] ESIGETEL Ecole Sup é rieure d’Ing é nieurs en Informatique et G é nie des T é l é communications. ____________________________________________________ - PowerPoint PPT PresentationTRANSCRIPT
The lifespan, accessibility and archiving of dynamic documents.
K. Wegrzyn-Wolska [email protected]
ESIGETEL
Ecole Supérieure d’Ingénieurs en Informatique et Génie des Télécommunications
________________________________________________________________________________________________________OSWIR 2005 - Workshop on Open Source Web Information Retrieval
in association with the2005 IEEE/WIC/ACM International Conferences on Web Intelligence
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 2
Presentation
Introduction The Lifespan and Age of Dynamic Documents Dynamic Document Categories
News published on the Web Weblog sites Search Engines
Archiving Statistical Evaluation Conclusions
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 3
Introduction
Some questions : Real documents? Temporary presentation of data? Created automatically? Created as a response to the users' questions? HTML page with some dynamic parts?
(layers, scripts, etc.) Created on-line by the Web server?
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 4
Dynamic documents categories
individual demands from the user : Results from Search Engines, Response from data forms, etc.
automatically by specialised application : different News sites, forums, etc.
different behaviour and characteristics
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 5
The Lifespan and Age of Dynamic Documents
Question : How to evaluate the lifespan of dynamic
documents? documents disappear immediately from the
computers’ memory after consultation How to determine age of the dynamic
documents? http header Modified and Expired or the value
fixed in HTML file with the META tag
Answer : period where the response for the same questions
doesn’t change period it is the lifespan visible by the user
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 6
The Lifespan and Age of Dynamic Documents
News sites Various informations
General sites, World news Local news, Etc.
characteristics : Automatically created, Updated instantaneously, Archiving
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 7
The Lifespan and Age of Dynamic Documents
news url updating Archiving
French Google http://news.google.fr ~20 min 30 days
Google http://news.google.com ~20 min 30 days
Voilà News http://actu.voila.fr/ 1 day 1 week
Voila http://actu.voila.fr/Depeche/
(~30min) 1 week
CNN http://www.cnn.com/
Yahoo!News http://fr.news.yahoo.com/ instantaneously 1 week
TF1 news http://news.tf1.fr/news/ instantaneously
News now http://www.newsnow.co.uk/
5 min
Les Infos http://www.lesinfos.com/ Since 2000
CategoryNet http://www.categorynet.com
Every day Never ending
CompanynewsGroup
http://www.companynewsgroup.com
40 per day 2003 & 2004archived;1999 – 2003 in a futur
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 8
The Lifespan and Age of Dynamic Documents
Weblogs: Daily updated, Form:
Web page modified, Varied information,
General: Dynamic pages Regulary updated
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 9
The Lifespan and Age of Dynamic Documents
Search Engines : response :
Dynamic pages created on-line
Lifespan, accessibility : period when the Search Engine's answer doesn’t
change , updating frequency of index-databases Example :
Google : 4 weeks
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 10
Archiving
Question ? how to store (archive) the dynamics documents ?
Solution : simple : printed version saved by the user put into special caching and archiving systems
applications, which try to save up to date Web image example : Wayback Machine
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 11
Archiving (Google News by WeybackMachine)
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 12
Archiving (ActuVoila by WeybackMachine)
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 13
Statistical evaluation
index-database updating frequency : frequency analysis for the indexing robots visits used
by Search Engines
statistic tests (news et Weblog sites)
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 14
Statistical evaluation
4 categories of different sites : Sportstrategies the sport news service modified very regular,
very regular News site,with a constant update time(every hour
BBC broadcast on-line, the lifespan is very irregular because the information is updated
instantaneously , when available
TF1, updated frequently during the day no modifications during the night
Weblog (Slashdot.org) changes here very quickly, new articles are broadcast very often the current discussions continue incessantly lifespan of these dynamically changed pages is extremely short
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 15
Statistical evaluation
News of JO 2004 at Athens broadcast by the site Sportstrategies
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 16
Statistical evaluation
BBC News
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 17
Statistical evaluation
TF1 News : 24/24
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 18
The statistical evaluation
TF1 News : working hours
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 19
The statistical evaluation
Weblog Slashdot
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 20
The statistical evaluation
Lifespan Comparison
Testedservice
Lifespan
mean min max
Slashdot.org 77 sec 10 sec 22 min
BBC.News 8,5 min 1 min 66 min
TF1.news(24/24) 19,5 min 1 min 502 min
TF1.news (working hours)
6,3 min 1 min 49 min
Sportsynergies 56 min 9 min 61 min
15 octobre 2004 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004 21
Conclusion
Dynamic documents: don’t exist in reality, disappear from the computer memory directly after
consultation. real lifespan is very short
But… Can be accessible for a long time and can be stored
by special archiving systems Management of the archived dynamic documents:
lifespan is identical to that of static documents, because the dynamic documents are stored in the same way as static ones.