peter webster interrogating the archived uk web

Interrogating the archived

UK web

Peter Webster (British Library)

@pj_webster / @UKWebArchive

webarchive.org.uk

www.bl.uk 2

Big UK Domain Data for the Arts &

Humanities

• Led by Dr Jane Winters (IHR) (@jfwinters)

• In partnership with the British Library and the Oxford Internet Institute

• With the help of Niels Brügger (Aarhus University, Denmark) @NielsBr

• Co-investigators: Ralph Schroeder, Eric Meyer (@etmeyer), Helen

Hockx-Yu (@hhockx)

• The team includes: Jonathan Blaney, @JoshCowls, @anjacks0n

• Funded by the AHRC, Jan 2014 – March 2015

• http://buddah.projects.history.ac.uk/

http://buddah.projects.history.ac.uk/

www.bl.uk 3

Project aims

• To highlight the value of web archives as a source for A&H, & to

transform the way in which researchers interact with the data

• To establish a theoretical and methodological framework for the analysis

of web archives, focusing on the JISC UK Web Domain Dataset

• To explore the ethical implications of big data research, and particularly

as they relate to the web

• To inform collection development and access arrangements for the UK

web archive at the British Library

www.bl.uk 4

Project outputs

• a suite of tools to support analysis of web archives by A&H researchers

• an enhanced interface through which researchers access the archived

material held by the British Library

• a history of the development of UK web space from 1996 to 2013,

analysing technical, social, organisational and cultural developments

and trends in the dataset

• a series of case studies across a range of A&H disciplines

• two project workshops, bringing together researchers, archivists,

technologists, and digital preservation professionals

• a free online training module illustrating the use of web archives and the

application of big data techniques and methods.

www.bl.uk 5

Forthcoming event

Web archives as big data

Wednesday, 3 December 2014 from 09:45 to 17:30

IHR, Senate House, United Kingdom

Booking at: http://tinyurl.com/webarchives

http://tinyurl.com/webarchives

www.bl.uk 6

A new class of primary source ?

Deswarte and Webster,

“Web Archives: A New Class of Primary Source for

Historians?”

IHR Digital History seminar, 2013, reporting on predecessor

project (AADDA)

http://tinyurl.com/qca3yy5

http://tinyurl.com/qca3yy5

www.bl.uk 7

webarchivehistorians.org (@HistWebArchives)

www.bl.uk 8

The UK Web Archive: three archives in one

Open UK Web Archive (2004-)

• c.14,000 sites

• Curated, selective, permission-based

• webarchive.org.uk

Legal Deposit UK Web Archive (2013-)

• legal framework

• c.4-5 million hosts per year

• onsite only

JISC UK Web Domain Dataset

www.bl.uk 9

JISC UK Web Domain Dataset 1996-2013

• Funded by JISC to create a research collection of UK

websites

• Collaboration between the Internet Archive, JISC and the

British Library

• Copy of subset of the Internet Archive’s web collection that

relates to the UK

• c.300 million resources, 60TB in total

• No local access – possible through the Internet Archive

• Can be used to generate secondary datasets

www.bl.uk 10

Use cases (generalised)

• Full-text/facet search -> individual resource

• Full-text/facet search -> analysis/visualisation

• Search -> corpus creation -> annotation/curation

• Corpus creation -> full-text search -> individual resource

• Corpus -> search -> analysis/visualisation

• Derived datasets -> take-away

• Direct access to WARC -> take-away

www.bl.uk 11

What do we know about each resource ?

From the crawl data

• crawl date

• URL (/page.html, host.domain.co.uk, domain.co.uk, .co.uk)

• file format

• file size

www.bl.uk 12

What do we know about each resource ?

From the full-text index

• page title

• link destinations (host.domain.co.uk, domain.co.uk, .co.uk)

• author (sometimes)

• language (sometimes)

www.bl.uk 13

What *don’t* we know ?

• subject

• geographic scope

• publisher

• date of publication

• date of last amendment

www.bl.uk 14

Thank you !

[email protected]

@pj_webster / @UKWebArchive

britishlibrary.typepad.co.uk/webarchive

webarchive.org.uk

peterwebster.me

mailto:[email protected]

peter webster interrogating the archived uk web

Education