workshop on web archiving - netlab.dk · module 1: web archiving 2 • introducing ourselves and...
TRANSCRIPT
![Page 1: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/1.jpg)
Workshop AU
16.01.2020 netlab.dk
Workshop on Web Archiving
MODULE 1:
WEB ARCHIVING: Theory — and a Bit of Practice
Niels Brügger
Asger Harlung
![Page 2: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/2.jpg)
netlab.dk
Workshop AU
16.01.2020
Module 1: Web Archiving
2
• Introducing ourselves and NetLab
• Why archive the web
• The research process and research examples
• Project presentation round
• Three kinds of digital content
• WWW as technology, and data mining example
• What is web archiving?
• Methods of web archiving and web crawling
• Challenges for the web crawler
• Crawling — advantages/disadvantages
• Characteristics of the archived web
![Page 3: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/3.jpg)
netlab.dk
Workshop AU
16.01.2020
Introducing Ourselves and NetLab
3
Niels Brügger – Professor in Media and Internet
Studies, Head of NetLab, and of the Centre for
Internet Studies, specialising in internet research
since 1997.
Asger Harlung – MA in ICT and learning, has
previously worked with research in digital rhetoric,
and supporting creativity development in learning
processes.
![Page 5: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/5.jpg)
netlab.dk
Workshop AU
16.01.2020
NetLab’s services are free for members of the DIGHUMLAB
communities (KB, and the humanities faculties at AU, AAU,
KU, SDU).
We offer different types of support, dependent of the needs of
the researcher.
Our focus is on the archived web — already archived or
needs to be archived.
NetLab Services
7
![Page 6: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/6.jpg)
8
Research project
netlab.dk
Intro workshop
PhD workshop
Online course
Ad hoc support
Borrow an IT
developer
NetLab Forum
Tools & tutorials ... and much more
On demand. min. 6 participants, 3 modules
For PhD stud., 1 ECTS, January and August
Own project, teacher, 6 assignments, 3 ECTS
IT support (Ulrich), research support (Niels)
Applications May & Sep, 2-4 weeks
Open forum, resear-chers and web archive
The researcher can enter NetLab via several entry points, and can use one or more entries
![Page 7: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/7.jpg)
netlab.dk
Workshop AU
16.01.2020
• To preserve the cultural heritage
• To preserve a stable research object
• To be able to document and illustrate a study
• Modern source references
• Documentation in general; legal claims
Why Archive the Web?
12
![Page 8: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/8.jpg)
netlab.dk
Workshop AU
16.01.2020
The Research Process
13
Close — middle — distant reading dr.dk — FV11-15 — entire .dk
Consider making a Research Data Management plan at: https://dmponline.deic.dk/
data collection data cleaning selection/corpus
creation
analysis (computer supported)
analysis (human supported)
visualisation long term
preservation
Legal challenges
![Page 9: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/9.jpg)
netlab.dk
NetLab projects with IT developer help
14
Let's have a look at the list of projects at:
http://www.netlab.dk/research/it-developer-projects/
![Page 10: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/10.jpg)
netlab.dk
Probing a Nation’s Web Domain — from Small Data to Big Data
15
The historical development of an entire national web:
.dk 2005-2015
The project is a collaboration with Netarkivet.
2006 2009 2012 2015
![Page 11: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/11.jpg)
netlab.dk
Probing a Nation’s Web Domain — from Small Data to Big Data
16
Grosslist of 'probes’:
• Size — e.g. bytes
• Space — e.g. geolocalisation
• Structure — e.g. network of hyperlinks
• Liveliness — e.g. domain names and updating
• Content — e.g. degrees of openness, files, software types,
language, website textual elements, semantics
![Page 15: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/15.jpg)
netlab.dk
Workshop AU
16.01.2020
Project Presentation Round
20
• Time to present yourselves and your projects
• Notes go on a whiteboard, and may be drawn upon for the
remainder of the day.
• We expect to return to some of these examples in the
afternoon, during the final part of the workshop.
![Page 16: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/16.jpg)
netlab.dk
Workshop AU
16.01.2020
Digitised Formerly analog media, transferred to a digital form.
Born Digital Has not previously existed in any other form than digital.
Reborn Digital Born digital content which has been gathered and
preserved, and to some extent has been changed in the
process.
Three kinds of digital content
21
![Page 17: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/17.jpg)
netlab.dk
Workshop AU
16.01.2020
WWW as Technology
22
How is a web page like this created?
![Page 18: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/18.jpg)
netlab.dk
Workshop AU
16.01.2020
WWW — one among other internet protocols:
http — Hyper Text Transfer Protocol
URL — Uniform Resource Identifier (Locator)
html — Hyper Text Markup Language
Constructing a URL on WWW:
protocol://subdomain.domain.topdomain/path/page/
http://cc.au.dk/research/researchprograms/
WWW as Technology
23
![Page 19: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/19.jpg)
Web pages = patched together in an ‘empty’ shell (stylesheet) of material from databases
24
The browser (Safari, Firefox...) translates html into writing, pictures etc.
Network of computers
html html html html html html html html html html html html
Computer (webserver)URL, dr.dk
Computer (user)
http
http
Computer (webserver) as database, CMS (Content Management System), URL dr.dk
Web pages = html-files
Images
Heading
Words
Computer (webserver) as database, URL, e.g. dmi.dk
Weather
Comp. X
Comp. Y
![Page 20: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/20.jpg)
25
Small Exercise: Source Code
![Page 21: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/21.jpg)
28
Small Exercise: Page Source
![Page 22: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/22.jpg)
29
Small Exercise: Page Source
This allows you
to access the
underlying HTML
code for the
entire web page
![Page 23: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/23.jpg)
30
Small Exercise: Page Source
… and can be
used for example
to search for
HTML tags, or
file types, or to
backtrack
content from
other pages …
![Page 24: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/24.jpg)
31
Small Exercise: Source Code
![Page 25: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/25.jpg)
32
Small Exercise: Source Code
![Page 26: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/26.jpg)
33
Small Exercise: Source Code
![Page 27: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/27.jpg)
netlab.dk
Workshop AU
16.01.2020
• A researcher wanted to track how Danish enclaves in
U.S.A. presented themselves.
• Text and images were important.
• The example is authentic. What is needed is:
1) Knowledge of ”web inspection”,
2) Taking a closer look at existing data, and
3) A bit of persistence :-)
Data Mining Example
34
![Page 28: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/28.jpg)
35
Data Mining Example
![Page 29: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/29.jpg)
36
Data Mining Example
![Page 30: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/30.jpg)
37
Data Mining Example
![Page 31: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/31.jpg)
38
Data Mining Example
![Page 32: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/32.jpg)
39
Data Mining Example
![Page 33: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/33.jpg)
40
Data Mining Example
![Page 34: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/34.jpg)
41
Data Mining Example
![Page 35: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/35.jpg)
42
Data Mining Example
![Page 36: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/36.jpg)
43
Data Mining Example
![Page 37: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/37.jpg)
44
Data Mining Example
![Page 38: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/38.jpg)
45
Data Mining Example
![Page 39: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/39.jpg)
46
Data Mining Example
![Page 40: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/40.jpg)
netlab.dk
Workshop AU
16.01.2020
What is Web Archiving?
47
International Internet Preservation Consortium’s definition:
”… the process of gathering up data that has been published on the World Wide Web, storing it, ensuring the data is preserved in an archive, and making the collected data available for future research.”
(https://web.archive.org/web/20170606072544/http://netpreserve.org/about-us) (Removed over the summer of 2017 this definition itself can only be retrieved from web archives).
”Any form of deliberate and purposive collection and preservation of web material.”
Brügger, Niels (2018): The Archived Web: Doing History in the Digital Age. MIT Press, p. 79
![Page 41: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/41.jpg)
netlab.dk
Workshop AU
16.01.2020
What is Web Archiving?
48
Macro archiving
• Cultural heritage institutions
• Preserve as much as possible
• Big and varied data
• IT expertise, advanced technology, computer power
Micro archiving
• Individual researcher/research group
• Stablize a concrete research object, here-and-now
• No experience, no advanced technology or computer
power
![Page 42: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/42.jpg)
netlab.dk
Workshop AU
16.01.2020
Methods of Web Archiving
49
• Web crawling (hyperlink crawling)
• Screen image
• Screen filming
• Harvesting via API
• (Delivery from producers)
![Page 43: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/43.jpg)
netlab.dk
Workshop AU
16.01.2020
Web Crawling
50
domain.com
page
page page
page
page page page
page
page
![Page 44: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/44.jpg)
netlab.dk
Workshop AU
16.01.2020
Web Crawling
51
domain.com
page
page page
page
page page page
crawler
page
page
1
0
2
3
![Page 45: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/45.jpg)
52
domain.dk
page page page page
page page page
page page page
page page page
page page page page page page
URL URL URL URL URL …
domain.dk
page
page
page page
page
page
page
page
page
page
page
page
page
page
page
page
page
page
page
page
domain.dk
page page page page
page page page
page page page
page page
page page page page page page
crawler
crawler
domain.dk
domain.com
JOB ID
![Page 46: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/46.jpg)
netlab.dk
Workshop AU
16.01.2020
Web Crawling
53
domainX.com
page
page page
page
page page page
crawler
page
page domainY.com
page
page page
page
page page
page
crawler
By-Harvest
domainX.com …
JOB ID 11
domainY.com …
JOB ID 12
![Page 47: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/47.jpg)
netlab.dk
Workshop AU
16.01.2020
Challenges for the crawler
54
• JavaScripts
• Content based on Flash
• Interactive pages
• Streamed content
• Websites with access limitations (password, captcha)
• Cookies, adds, plugins etc.
• Robots.txt
• Deep web (e.g. databaser, ftp-server, password-protected
content, hidden content, pages not linked to, dynamic
content based on requests).
http://da.wikipedia.org/wi
ki/CAPTCHA
![Page 48: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/48.jpg)
netlab.dk
Pages not being crawled
55
✔
domain
✔
✔ ✔ ✔ ✔
✔ ✔ ✔
✔ ✔ page
✔ ✔
✔ ✔ ✔ ✔ ✔ ✔
page page page page
Not crawled
– too deep
page page
Not crawled
– password
protected
domain
page page Not
crawled –
robots.txt
page page
Not crawled – script
![Page 49: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/49.jpg)
56
Elements not crawled _ Netarkivet
![Page 50: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/50.jpg)
57
Elements not crawled _ Netarkivet
![Page 51: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/51.jpg)
58
Elements not crawled _ Internet Archive
![Page 52: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/52.jpg)
netlab.dk
Workshop AU
16.01.2020
Characteristics of the Archived Web
61
What is archived is not a 1:1 copy of the material one attempted to archive
It is versions/reconstructions:
• Created in the process of archiving
• On the basis of a number of choices made by the archiver
(harvesting strategy, settings, etc.)
• The choices made have consequences for what is
archived
• The archived objects are re-assembled in the archive
’replay’
![Page 53: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/53.jpg)
netlab.dk
Workshop AU
16.01.2020
Characteristics of the Archived Web
62
The archived version is deficient because of:
• Technical challenges
• Web’s specific characteristics: dynamic, unpredictable
• Potential asynchronicity between updating and archiving
→ archiving takes time
→ certain elements cannot be archived
It is an added challenge that we do not know what is missing:
• Not much documentation
• No baseline to compare with
![Page 54: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/54.jpg)
netlab.dk
Workshop AU
16.01.2020
Characteristics of the Archived Web
63
As scholars using archived web as an object of study, it is important that we are aware of the pitfalls and sources of error inherent in the material.
![Page 55: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/55.jpg)
netlab.dk
Workshop AU
16.01.2020
Characteristics of the Archived Web
64
It is versions/reconstructions:
• The archived objects are re-assembled in the archive
’replay’
![Page 56: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/56.jpg)
netlab.dk
Workshop AU
15.08.2019
65
do not expect to find this...
... but rather this. Thanks to Emily Maemura
for these illustrations
![Page 57: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/57.jpg)
66
IN CONTRAST TO DIGITIZED COLLECTIONS: TO A LARGE EXTENT ARCHIVED WEB IS ALREADY MARKED UP — HTML, FILE NAMES...
html + files
Online web archiving
Link list Named entities ?
![Page 58: Workshop on Web Archiving - netlab.dk · Module 1: Web Archiving 2 • Introducing ourselves and NetLab • Why archive the web • The research process and research examples •](https://reader033.vdocuments.us/reader033/viewer/2022042803/5f4754a272f4cc050e5bcd7f/html5/thumbnails/58.jpg)
Workshop AU
16.01.2020 netlab.dk
Workshop on Web Archiving
MODULE 1
WEB ARCHIVING: Theory — and a Bit of Practice
Niels Brügger
Asger Harlung