peter webster interrogating the archived uk web

14
Interrogating the archived UK web Peter Webster (British Library) @pj_webster / @UKWebArchive webarchive.org.uk

Upload: historyspot

Post on 02-Jul-2015

1.251 views

Category:

Education


3 download

DESCRIPTION

Digital History seminar 4 November 2014 Live Stream: http://ihrdighist.blogs.sas.ac.uk/2014/10/28/tuesday-4-november-interrogating-the-archived-uk-web-historians-and-social-scientists-research-experiences/

TRANSCRIPT

Page 1: Peter webster   interrogating the archived uk web

Interrogating the archived

UK web

Peter Webster (British Library)

@pj_webster / @UKWebArchive

webarchive.org.uk

Page 2: Peter webster   interrogating the archived uk web

www.bl.uk 2

Big UK Domain Data for the Arts &

Humanities

• Led by Dr Jane Winters (IHR) (@jfwinters)

• In partnership with the British Library and the Oxford Internet Institute

• With the help of Niels Brügger (Aarhus University, Denmark) @NielsBr

• Co-investigators: Ralph Schroeder, Eric Meyer (@etmeyer), Helen

Hockx-Yu (@hhockx)

• The team includes: Jonathan Blaney, @JoshCowls, @anjacks0n

• Funded by the AHRC, Jan 2014 – March 2015

• http://buddah.projects.history.ac.uk/

Page 3: Peter webster   interrogating the archived uk web

www.bl.uk 3

Project aims

• To highlight the value of web archives as a source for A&H, & to

transform the way in which researchers interact with the data

• To establish a theoretical and methodological framework for the analysis

of web archives, focusing on the JISC UK Web Domain Dataset

• To explore the ethical implications of big data research, and particularly

as they relate to the web

• To inform collection development and access arrangements for the UK

web archive at the British Library

Page 4: Peter webster   interrogating the archived uk web

www.bl.uk 4

Project outputs

• a suite of tools to support analysis of web archives by A&H researchers

• an enhanced interface through which researchers access the archived

material held by the British Library

• a history of the development of UK web space from 1996 to 2013,

analysing technical, social, organisational and cultural developments

and trends in the dataset

• a series of case studies across a range of A&H disciplines

• two project workshops, bringing together researchers, archivists,

technologists, and digital preservation professionals

• a free online training module illustrating the use of web archives and the

application of big data techniques and methods.

Page 5: Peter webster   interrogating the archived uk web

www.bl.uk 5

Forthcoming event

Web archives as big data

Wednesday, 3 December 2014 from 09:45 to 17:30

IHR, Senate House, United Kingdom

Booking at: http://tinyurl.com/webarchives

Page 6: Peter webster   interrogating the archived uk web

www.bl.uk 6

A new class of primary source ?

Deswarte and Webster,

“Web Archives: A New Class of Primary Source for

Historians?”

IHR Digital History seminar, 2013, reporting on predecessor

project (AADDA)

http://tinyurl.com/qca3yy5

Page 7: Peter webster   interrogating the archived uk web

www.bl.uk 7

webarchivehistorians.org (@HistWebArchives)

Page 8: Peter webster   interrogating the archived uk web

www.bl.uk 8

The UK Web Archive: three archives in one

Open UK Web Archive (2004-)

• c.14,000 sites

• Curated, selective, permission-based

• webarchive.org.uk

Legal Deposit UK Web Archive (2013-)

• legal framework

• c.4-5 million hosts per year

• onsite only

JISC UK Web Domain Dataset

Page 9: Peter webster   interrogating the archived uk web

www.bl.uk 9

JISC UK Web Domain Dataset 1996-2013

• Funded by JISC to create a research collection of UK

websites

• Collaboration between the Internet Archive, JISC and the

British Library

• Copy of subset of the Internet Archive’s web collection that

relates to the UK

• c.300 million resources, 60TB in total

• No local access – possible through the Internet Archive

• Can be used to generate secondary datasets

Page 10: Peter webster   interrogating the archived uk web

www.bl.uk 10

Use cases (generalised)

• Full-text/facet search -> individual resource

• Full-text/facet search -> analysis/visualisation

• Search -> corpus creation -> annotation/curation

• Corpus creation -> full-text search -> individual resource

• Corpus -> search -> analysis/visualisation

• Derived datasets -> take-away

• Direct access to WARC -> take-away

Page 11: Peter webster   interrogating the archived uk web

www.bl.uk 11

What do we know about each resource ?

From the crawl data

• crawl date

• URL (/page.html, host.domain.co.uk, domain.co.uk, .co.uk)

• file format

• file size

Page 12: Peter webster   interrogating the archived uk web

www.bl.uk 12

What do we know about each resource ?

From the full-text index

• page title

• link destinations (host.domain.co.uk, domain.co.uk, .co.uk)

• author (sometimes)

• language (sometimes)

Page 13: Peter webster   interrogating the archived uk web

www.bl.uk 13

What *don’t* we know ?

• subject

• geographic scope

• publisher

• date of publication

• date of last amendment

Page 14: Peter webster   interrogating the archived uk web

www.bl.uk 14

Thank you !

[email protected]

@pj_webster / @UKWebArchive

britishlibrary.typepad.co.uk/webarchive

webarchive.org.uk

peterwebster.me