Download - Ancient History of the UK Web
Ancient History of the UK Web
With support by and thanks to Ning Wang and Adham Tamer
Josh Cowls, Scott A. Hale, Helen Margetts, Eric T. Meyer, Ralph Schroeder, Taha Yasseri
Past Web Archive Activities at OII • 2008-2009. JISC/NEH Transatlantic Digitisation Collaboration: World Wide Web of
Humanities (Jisc & NEH funded) – OII, Internet Archive, Hanzo Archives – Meyer, E.T., Carpenter, K., Middleton, M. (2009). World Wide Web of Humanities: Final
Report to JISC. Online: http://www.jisc.ac.uk/media/documents/programmes/digitisation/humanitiesfinalreport.pdf
• 2010. Researcher Engagement with Web Archives (Jisc funded) – OII, VKS – Dougherty, M., Meyer, E.T., Madsen, C., van den Heuvel, C., Thomas, A., Wyatt, S. (2010).
Researcher Engagement with Web Archives: State of the Art. London: JISC. Online: http://ssrn.com/abstract=1714997 and http://ie-repository.jisc.ac.uk/544/
– Thomas, A., Meyer, E.T., Dougherty, M., van den Heuvel, C., Madsen, C., Wyatt, S. (2010). Researcher Engagement with Web Archives: Challenges and Opportunities for Investment. London: JISC. Online: http://ssrn.com/abstract=1715000 and http://ie-repository.jisc.ac.uk/543/
– Dougherty, M., Meyer, E.T. (2014). Community, Tools, and Practices in Web Archiving: The state of the art in relation to social science and humanities research needs. Journal of the American Society of Information Science & Technology. http://onlinelibrary.wiley.com/doi/10.1002/asi.23099/abstract
• 2011. Using Web Archives: A Futures Perspective (IIPC funded) – OII – Meyer, E.T., Thomas, A.J., Schroeder, R. (2011). Web Archives: The Future(s). London:
IIPC. Online: http://ssrn.com/abstract=1830025
Recent Web Archive Activities at OII • 2013-2015: Jisc Big Data project (Jisc funded)
– OII, British Library
– Prepare and release hyperlink corpus
• 2014-2015: Big UK Domain Data for the Arts and Humanities (AHRC funded)
– IHR, OII, British Library
– Supporting researchers in Arts & Humanities to use web archive data
– Producing edited book of empirical studies concerning the history of the UK web
• First paper from these combined projects
– Hale, S.A., Yasseri, T., Cowls, J., Meyer, E.T., Schroeder, R., Margetts, H. (2014, July). Mapping the UK webspace: Fifteen years of British universities on the web. ACM WebSci’14, Bloomington, Indiana. http://papers.ssrn.com/abstract=2435481 or http://arxiv.org/abs/1405.2856
Big Data: Demonstrating the Value of the UK Web Domain Dataset
for Social Science Research
This project aims to enhance JISC's UK Web Domain archive, a 30 TB archive of the .uk country-code top level domain collected from 1996 to 2010. It will extract link graphs from the data and disseminate social science research using the collection.
February 2012 - February 2014
Taming a mammoth: Web Archive Dataset Preparation
30 TB compressed data
6.2TB metadata and links
2.5 TB temporal links
30 TB compressed data in (w)arc format
– Approx. 4.5 million files
– Mix of binary and plain text payloads along
with header data
– Two formats: old arc and newer warc
Housed at the BL, access restrictions
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://hits.guardian.co.uk/b/ss/guardiangu-blogs,guardiangu-news,guardiangu-
network/1/H.22.2/56938?ns=guardian&pageName=Prisoner+of+war+camps+in+the+UK+mapped+and+listed.+Download+the+d
ata%3AGraphic%3A1476560&ch=News&c3=GU.co.uk&c4=History+%28Books+genre%29%2CBooks%2CSecond+world+war+
%28News%29%2CGermany%2CUK+news%2CTechnology&c5=Not+commercially+useful%2CCorporate+IT&c6=Simon+Roger
s&c7=10-Nov-
08&c8=1476560&c9=Graphic&c10=Blogpost&c11=News&c13=&c25=Datablog&c30=content&h2=GU%2FNews%2Fblog%2FDa
tablog&c2=GUID:(none)
WARC-Date: 2010-12-05T02:58:00Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 66.235.138.18
WARC-Record-ID: <urn:uuid:7d5ce147-9b4b-46cb-8975-ee93b4d0dda8>
Content-Type: application/http; msgtype=response
Content-Length: 740
HTTP/1.1 302 Found
Date: Sun, 05 Dec 2010 02:58:00 GMT
Server: Omniture DC/2.0.0
X-C: ms-4.3.1
Expires: Sat, 04 Dec 2010 02:58:00 GMT
Last-Modified: Mon, 06 Dec 2010 02:58:00 GMT
Cache-Control: no-cache, no-store, must-revalidate, max-age=0, proxy-revalidate, no-transform, private
Pragma: no-cache
ETag: "4CFAFFB8-0E4C-7443902F"
Vary: *
P3P: policyref="/w3c/p3p.xml", CP="NOI DSP COR NID PSA OUR IND COM NAV STA"
Location: http://b.scorecardresearch.com/r?c2=6035250&d.c=gif&d.o=guardiangu-
network&d.x=243551159&d.t=page&d.u=http%3A%2F%2Fwww.guardian.co.uk%2Fnews%2Fdatablog%2F2010%2Fnov%2F08
%2Fprisoner-of-war-camps-uk
xserver: www422
Content-Length: 0
Keep-Alive: timeout=15
Connection: close
Content-Type: text/plain
Extract meta-data and links (wat format)
– Approx. 4.5 million files
– 6.2TB on disk compressed
– Housed at OII
– Structured JSON
– Different formats for arc/warcs
{ "Container": { "Filename": "DOTUK-HISTORICAL-1996-2010-GROUP-AA-XAAAAA-20110428000000-00000.arc.gz", "Offset": "88937", "Compressed": true, "Gzip-Metadata": { "Header-Length": "10", "Inflated-CRC": "-1223265901", "Inflated-Length": "26073", "Deflate-Length": "4463", "Footer-Length": "8" } }, "Envelope": { "ARC-Header-Length": "102", "ARC-Header-Metadata": { "Date": "20080509081524", "Target-URI": "http://www.ukhomeinteriors.co.uk/content/ext_corbels.php", "Content-Length": "25970", "Content-Type": "text/html", "IP-Address": "83.223.106.10" }, "Payload-Metadata": { "Actual-Content-Type": "application/http; msgtype=response", "Block-Digest": "sha1:MCCZNOKBJHTZ5MMMCUJGBPE25C2TVUWF", "HTTP-Response-Metadata": { "Headers-Length": "591", "HTML-Metadata": { "Head": { "Title": "Exterior Corbels",
Plain text lists Build own ad-hawk Hadoop cluster, fix incompatibilities, divide into smaller batches
– Build plain text lists of pages and hyperlinks
– Remove error page (e.g., 404 Not Found)
– Remove pages not in .uk
– Standardize dates (many formats)
– Standardize hyperlinks (trailing /, etc.)
– Fix/remove tons of invalid hyperlinks (whitespace, invalid characters, etc.)
Load results into Apache Hive (2.5 TB)
Source Destination Time LinkText
http://octopus.well.ox.ac.uk:80/ http://octopus.well.ox.ac.uk:80/links.html 1032758438 Links
http://octopus.well.ox.ac.uk:80/ http://octopus.well.ox.ac.uk:80/projects.html 1001793436 Projects
http://octopus.well.ox.ac.uk:80/computing.shtml http://debian.org/ 1075794060 Debian/GNU
Overall Statistics
Third-level-
domains:
e.g.
ox.ac.uk
Relative size of second-level-domains
Number of links within SLD per node
Cross-domain links (2010)
Absolute Normalized to target size
Case of ac.uk
Mapping the UK Webspace: Fifteen Years of British Universities on the Web
Hale et al., WebSci'14, available: http://arxiv.org/abs/1405.2856
121 UK universities websites and links 1) League table ranking 2) Group affiliation 3) Geographical location
Group Affiliations
League table ranking
Geography
Colour ~ intensity
Gravity Law σ𝑖𝑗 =
𝑠𝑖𝑗
𝑠𝑖𝑜𝑢𝑡𝑠𝑗
𝑖𝑛
𝑠𝑖𝑗 =𝑠𝑖𝑜𝑢𝑡𝑠𝑗
𝑖𝑛
𝑟0.28
Big UK Domain Data for the Arts and Humanities
Primary aim: developing a methodological and theoretical framework within which to study over 15 years of UK domain data – with lessons for the future study of web archives more generally
Big UK Domain Data for the Arts and Humanities
The dataset:
– Crawled from 1996 – 2013
– Approximately 65 TB, billions of words
– Building interface to allow search by retrieval date, target domain of links, sentiment
– Allow qualitative and quantitative analysis – and iteration between multiple research techniques
Big UK Domain Data for the Arts and Humanities
Key outputs:
– Ten bursary projects using web archive data to investigate a broad range of topics, for example… • Armed services recruitment online
• The accessibility of the web for disabled users
• Online discussions of ‘Beat’ poetry
– An edited book of empirical studies concerning the history of the UK web, featuring chapters on, for example… • Constitutional and institutional change in UK government
• The BBC’s online presence
• The ‘web of faith’ online
Next
● Studies underway at OII, BL, IHR
● Book and articles
– Study overall growth of .uk
– Case study of .gov.uk
– Study of media and select committee
visibility
● Releasing data open source