ancient history of the uk web

24
Ancient History of the UK Web With support by and thanks to Ning Wang and Adham Tamer Josh Cowls, Scott A. Hale, Helen Margetts, Eric T. Meyer, Ralph Schroeder, Taha Yasseri

Upload: scott-a-hale

Post on 08-May-2015

267 views

Category:

Data & Analytics


2 download

DESCRIPTION

Slides for a presentation on recent work with Web Archives at the Oxford Internet Institute (http://www.oii.ox.ac.uk/) given at WIRE2014 (http://wp.comminfo.rutgers.edu/nsfia/schedule/)

TRANSCRIPT

Page 1: Ancient History of the UK Web

Ancient History of the UK Web

With support by and thanks to Ning Wang and Adham Tamer

Josh Cowls, Scott A. Hale, Helen Margetts, Eric T. Meyer, Ralph Schroeder, Taha Yasseri

Page 2: Ancient History of the UK Web

Past Web Archive Activities at OII • 2008-2009. JISC/NEH Transatlantic Digitisation Collaboration: World Wide Web of

Humanities (Jisc & NEH funded) – OII, Internet Archive, Hanzo Archives – Meyer, E.T., Carpenter, K., Middleton, M. (2009). World Wide Web of Humanities: Final

Report to JISC. Online: http://www.jisc.ac.uk/media/documents/programmes/digitisation/humanitiesfinalreport.pdf

• 2010. Researcher Engagement with Web Archives (Jisc funded) – OII, VKS – Dougherty, M., Meyer, E.T., Madsen, C., van den Heuvel, C., Thomas, A., Wyatt, S. (2010).

Researcher Engagement with Web Archives: State of the Art. London: JISC. Online: http://ssrn.com/abstract=1714997 and http://ie-repository.jisc.ac.uk/544/

– Thomas, A., Meyer, E.T., Dougherty, M., van den Heuvel, C., Madsen, C., Wyatt, S. (2010). Researcher Engagement with Web Archives: Challenges and Opportunities for Investment. London: JISC. Online: http://ssrn.com/abstract=1715000 and http://ie-repository.jisc.ac.uk/543/

– Dougherty, M., Meyer, E.T. (2014). Community, Tools, and Practices in Web Archiving: The state of the art in relation to social science and humanities research needs. Journal of the American Society of Information Science & Technology. http://onlinelibrary.wiley.com/doi/10.1002/asi.23099/abstract

• 2011. Using Web Archives: A Futures Perspective (IIPC funded) – OII – Meyer, E.T., Thomas, A.J., Schroeder, R. (2011). Web Archives: The Future(s). London:

IIPC. Online: http://ssrn.com/abstract=1830025

Page 3: Ancient History of the UK Web

Recent Web Archive Activities at OII • 2013-2015: Jisc Big Data project (Jisc funded)

– OII, British Library

– Prepare and release hyperlink corpus

• 2014-2015: Big UK Domain Data for the Arts and Humanities (AHRC funded)

– IHR, OII, British Library

– Supporting researchers in Arts & Humanities to use web archive data

– Producing edited book of empirical studies concerning the history of the UK web

• First paper from these combined projects

– Hale, S.A., Yasseri, T., Cowls, J., Meyer, E.T., Schroeder, R., Margetts, H. (2014, July). Mapping the UK webspace: Fifteen years of British universities on the web. ACM WebSci’14, Bloomington, Indiana. http://papers.ssrn.com/abstract=2435481 or http://arxiv.org/abs/1405.2856

Page 4: Ancient History of the UK Web

Big Data: Demonstrating the Value of the UK Web Domain Dataset

for Social Science Research

This project aims to enhance JISC's UK Web Domain archive, a 30 TB archive of the .uk country-code top level domain collected from 1996 to 2010. It will extract link graphs from the data and disseminate social science research using the collection.

February 2012 - February 2014

Page 5: Ancient History of the UK Web

Taming a mammoth: Web Archive Dataset Preparation

30 TB compressed data

6.2TB metadata and links

2.5 TB temporal links

Page 6: Ancient History of the UK Web

30 TB compressed data in (w)arc format

– Approx. 4.5 million files

– Mix of binary and plain text payloads along

with header data

– Two formats: old arc and newer warc

Housed at the BL, access restrictions

Page 7: Ancient History of the UK Web

WARC/1.0

WARC-Type: response

WARC-Target-URI: http://hits.guardian.co.uk/b/ss/guardiangu-blogs,guardiangu-news,guardiangu-

network/1/H.22.2/56938?ns=guardian&pageName=Prisoner+of+war+camps+in+the+UK+mapped+and+listed.+Download+the+d

ata%3AGraphic%3A1476560&ch=News&c3=GU.co.uk&c4=History+%28Books+genre%29%2CBooks%2CSecond+world+war+

%28News%29%2CGermany%2CUK+news%2CTechnology&c5=Not+commercially+useful%2CCorporate+IT&c6=Simon+Roger

s&c7=10-Nov-

08&c8=1476560&c9=Graphic&c10=Blogpost&c11=News&c13=&c25=Datablog&c30=content&h2=GU%2FNews%2Fblog%2FDa

tablog&c2=GUID:(none)

WARC-Date: 2010-12-05T02:58:00Z

WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ

WARC-IP-Address: 66.235.138.18

WARC-Record-ID: <urn:uuid:7d5ce147-9b4b-46cb-8975-ee93b4d0dda8>

Content-Type: application/http; msgtype=response

Content-Length: 740

HTTP/1.1 302 Found

Date: Sun, 05 Dec 2010 02:58:00 GMT

Server: Omniture DC/2.0.0

X-C: ms-4.3.1

Expires: Sat, 04 Dec 2010 02:58:00 GMT

Last-Modified: Mon, 06 Dec 2010 02:58:00 GMT

Cache-Control: no-cache, no-store, must-revalidate, max-age=0, proxy-revalidate, no-transform, private

Pragma: no-cache

ETag: "4CFAFFB8-0E4C-7443902F"

Vary: *

P3P: policyref="/w3c/p3p.xml", CP="NOI DSP COR NID PSA OUR IND COM NAV STA"

Location: http://b.scorecardresearch.com/r?c2=6035250&d.c=gif&d.o=guardiangu-

network&d.x=243551159&d.t=page&d.u=http%3A%2F%2Fwww.guardian.co.uk%2Fnews%2Fdatablog%2F2010%2Fnov%2F08

%2Fprisoner-of-war-camps-uk

xserver: www422

Content-Length: 0

Keep-Alive: timeout=15

Connection: close

Content-Type: text/plain

Page 8: Ancient History of the UK Web

Extract meta-data and links (wat format)

– Approx. 4.5 million files

– 6.2TB on disk compressed

– Housed at OII

– Structured JSON

– Different formats for arc/warcs

Page 9: Ancient History of the UK Web

{ "Container": { "Filename": "DOTUK-HISTORICAL-1996-2010-GROUP-AA-XAAAAA-20110428000000-00000.arc.gz", "Offset": "88937", "Compressed": true, "Gzip-Metadata": { "Header-Length": "10", "Inflated-CRC": "-1223265901", "Inflated-Length": "26073", "Deflate-Length": "4463", "Footer-Length": "8" } }, "Envelope": { "ARC-Header-Length": "102", "ARC-Header-Metadata": { "Date": "20080509081524", "Target-URI": "http://www.ukhomeinteriors.co.uk/content/ext_corbels.php", "Content-Length": "25970", "Content-Type": "text/html", "IP-Address": "83.223.106.10" }, "Payload-Metadata": { "Actual-Content-Type": "application/http; msgtype=response", "Block-Digest": "sha1:MCCZNOKBJHTZ5MMMCUJGBPE25C2TVUWF", "HTTP-Response-Metadata": { "Headers-Length": "591", "HTML-Metadata": { "Head": { "Title": "Exterior Corbels",

Page 10: Ancient History of the UK Web

Plain text lists Build own ad-hawk Hadoop cluster, fix incompatibilities, divide into smaller batches

– Build plain text lists of pages and hyperlinks

– Remove error page (e.g., 404 Not Found)

– Remove pages not in .uk

– Standardize dates (many formats)

– Standardize hyperlinks (trailing /, etc.)

– Fix/remove tons of invalid hyperlinks (whitespace, invalid characters, etc.)

Load results into Apache Hive (2.5 TB)

Page 11: Ancient History of the UK Web

Source Destination Time LinkText

http://octopus.well.ox.ac.uk:80/ http://octopus.well.ox.ac.uk:80/links.html 1032758438 Links

http://octopus.well.ox.ac.uk:80/ http://octopus.well.ox.ac.uk:80/projects.html 1001793436 Projects

http://octopus.well.ox.ac.uk:80/computing.shtml http://debian.org/ 1075794060 Debian/GNU

Page 12: Ancient History of the UK Web

Overall Statistics

Third-level-

domains:

e.g.

ox.ac.uk

Page 13: Ancient History of the UK Web

Relative size of second-level-domains

Page 14: Ancient History of the UK Web

Number of links within SLD per node

Page 15: Ancient History of the UK Web

Cross-domain links (2010)

Absolute Normalized to target size

Page 16: Ancient History of the UK Web

Case of ac.uk

Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Hale et al., WebSci'14, available: http://arxiv.org/abs/1405.2856

121 UK universities websites and links 1) League table ranking 2) Group affiliation 3) Geographical location

Page 17: Ancient History of the UK Web

Group Affiliations

Page 18: Ancient History of the UK Web

League table ranking

Page 19: Ancient History of the UK Web

Geography

Colour ~ intensity

Page 20: Ancient History of the UK Web

Gravity Law σ𝑖𝑗 =

𝑠𝑖𝑗

𝑠𝑖𝑜𝑢𝑡𝑠𝑗

𝑖𝑛

𝑠𝑖𝑗 =𝑠𝑖𝑜𝑢𝑡𝑠𝑗

𝑖𝑛

𝑟0.28

Page 21: Ancient History of the UK Web

Big UK Domain Data for the Arts and Humanities

Primary aim: developing a methodological and theoretical framework within which to study over 15 years of UK domain data – with lessons for the future study of web archives more generally

Page 22: Ancient History of the UK Web

Big UK Domain Data for the Arts and Humanities

The dataset:

– Crawled from 1996 – 2013

– Approximately 65 TB, billions of words

– Building interface to allow search by retrieval date, target domain of links, sentiment

– Allow qualitative and quantitative analysis – and iteration between multiple research techniques

Page 23: Ancient History of the UK Web

Big UK Domain Data for the Arts and Humanities

Key outputs:

– Ten bursary projects using web archive data to investigate a broad range of topics, for example… • Armed services recruitment online

• The accessibility of the web for disabled users

• Online discussions of ‘Beat’ poetry

– An edited book of empirical studies concerning the history of the UK web, featuring chapters on, for example… • Constitutional and institutional change in UK government

• The BBC’s online presence

• The ‘web of faith’ online

Page 24: Ancient History of the UK Web

Next

● Studies underway at OII, BL, IHR

● Book and articles

– Study overall growth of .uk

– Case study of .gov.uk

– Study of media and select committee

visibility

● Releasing data open source