peter webster interrogating the archived uk web
DESCRIPTION
Digital History seminar 4 November 2014 Live Stream: http://ihrdighist.blogs.sas.ac.uk/2014/10/28/tuesday-4-november-interrogating-the-archived-uk-web-historians-and-social-scientists-research-experiences/TRANSCRIPT
![Page 1: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/1.jpg)
Interrogating the archived
UK web
Peter Webster (British Library)
@pj_webster / @UKWebArchive
webarchive.org.uk
![Page 2: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/2.jpg)
www.bl.uk 2
Big UK Domain Data for the Arts &
Humanities
• Led by Dr Jane Winters (IHR) (@jfwinters)
• In partnership with the British Library and the Oxford Internet Institute
• With the help of Niels Brügger (Aarhus University, Denmark) @NielsBr
• Co-investigators: Ralph Schroeder, Eric Meyer (@etmeyer), Helen
Hockx-Yu (@hhockx)
• The team includes: Jonathan Blaney, @JoshCowls, @anjacks0n
• Funded by the AHRC, Jan 2014 – March 2015
• http://buddah.projects.history.ac.uk/
![Page 3: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/3.jpg)
www.bl.uk 3
Project aims
• To highlight the value of web archives as a source for A&H, & to
transform the way in which researchers interact with the data
• To establish a theoretical and methodological framework for the analysis
of web archives, focusing on the JISC UK Web Domain Dataset
• To explore the ethical implications of big data research, and particularly
as they relate to the web
• To inform collection development and access arrangements for the UK
web archive at the British Library
![Page 4: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/4.jpg)
www.bl.uk 4
Project outputs
• a suite of tools to support analysis of web archives by A&H researchers
• an enhanced interface through which researchers access the archived
material held by the British Library
• a history of the development of UK web space from 1996 to 2013,
analysing technical, social, organisational and cultural developments
and trends in the dataset
• a series of case studies across a range of A&H disciplines
• two project workshops, bringing together researchers, archivists,
technologists, and digital preservation professionals
• a free online training module illustrating the use of web archives and the
application of big data techniques and methods.
![Page 5: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/5.jpg)
www.bl.uk 5
Forthcoming event
Web archives as big data
Wednesday, 3 December 2014 from 09:45 to 17:30
IHR, Senate House, United Kingdom
Booking at: http://tinyurl.com/webarchives
![Page 6: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/6.jpg)
www.bl.uk 6
A new class of primary source ?
Deswarte and Webster,
“Web Archives: A New Class of Primary Source for
Historians?”
IHR Digital History seminar, 2013, reporting on predecessor
project (AADDA)
http://tinyurl.com/qca3yy5
![Page 7: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/7.jpg)
www.bl.uk 7
webarchivehistorians.org (@HistWebArchives)
![Page 8: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/8.jpg)
www.bl.uk 8
The UK Web Archive: three archives in one
Open UK Web Archive (2004-)
• c.14,000 sites
• Curated, selective, permission-based
• webarchive.org.uk
Legal Deposit UK Web Archive (2013-)
• legal framework
• c.4-5 million hosts per year
• onsite only
JISC UK Web Domain Dataset
![Page 9: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/9.jpg)
www.bl.uk 9
JISC UK Web Domain Dataset 1996-2013
• Funded by JISC to create a research collection of UK
websites
• Collaboration between the Internet Archive, JISC and the
British Library
• Copy of subset of the Internet Archive’s web collection that
relates to the UK
• c.300 million resources, 60TB in total
• No local access – possible through the Internet Archive
• Can be used to generate secondary datasets
![Page 10: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/10.jpg)
www.bl.uk 10
Use cases (generalised)
• Full-text/facet search -> individual resource
• Full-text/facet search -> analysis/visualisation
• Search -> corpus creation -> annotation/curation
• Corpus creation -> full-text search -> individual resource
• Corpus -> search -> analysis/visualisation
• Derived datasets -> take-away
• Direct access to WARC -> take-away
![Page 11: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/11.jpg)
www.bl.uk 11
What do we know about each resource ?
From the crawl data
• crawl date
• URL (/page.html, host.domain.co.uk, domain.co.uk, .co.uk)
• file format
• file size
![Page 12: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/12.jpg)
www.bl.uk 12
What do we know about each resource ?
From the full-text index
• page title
• link destinations (host.domain.co.uk, domain.co.uk, .co.uk)
• author (sometimes)
• language (sometimes)
![Page 13: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/13.jpg)
www.bl.uk 13
What *don’t* we know ?
• subject
• geographic scope
• publisher
• date of publication
• date of last amendment
![Page 14: Peter webster interrogating the archived uk web](https://reader033.vdocuments.us/reader033/viewer/2022060121/559494ba1a28abbe3e8b4792/html5/thumbnails/14.jpg)
www.bl.uk 14
Thank you !
@pj_webster / @UKWebArchive
britishlibrary.typepad.co.uk/webarchive
webarchive.org.uk
peterwebster.me