mapping the uk webspace: fifteen years of british universities on the web
TRANSCRIPT
Mapping the UK Webspace:
Scott A. Hale, Taha Yasseri, Josh Cowls, Eric T. Meyer, Ralph Schroeder, Helen Margetts
Fifteen Years of British Universities on the Web
@computermacgyve, @etmeyer
With our thanks to Ning Wang, Adham Tamer, Andreas Kaltenbrunner, and our reviewers.
Background, Motivation
• Web archives under-used
• Few longitudinal studies of the Web
• Clear division of sites by second level domain within .uk
– Academic websites (.ac.uk)
– Government websites (.gov.uk)
– Commercial websites (.co.uk)
Web Archive Dataset Preparation
30 TB compressed data
6.2TB metadata and links
2.5 TB temporal links
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://hits.guardian.co.uk/b/ss/guardiangu-blogs,guardiangu-news,guardiangu-
network/1/H.22.2/56938?ns=guardian&pageName=Prisoner+of+war+camps+in+the+UK+mapped+and+listed.+Download+the+d
ata%3AGraphic%3A1476560&ch=News&c3=GU.co.uk&c4=History+%28Books+genre%29%2CBooks%2CSecond+world+war+
%28News%29%2CGermany%2CUK+news%2CTechnology&c5=Not+commercially+useful%2CCorporate+IT&c6=Simon+Roger
s&c7=10-Nov-
08&c8=1476560&c9=Graphic&c10=Blogpost&c11=News&c13=&c25=Datablog&c30=content&h2=GU%2FNews%2Fblog%2FDa
tablog&c2=GUID:(none)
WARC-Date: 2010-12-05T02:58:00Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 66.235.138.18
WARC-Record-ID: <urn:uuid:7d5ce147-9b4b-46cb-8975-ee93b4d0dda8>
Content-Type: application/http; msgtype=response
Content-Length: 740
HTTP/1.1 302 Found
Date: Sun, 05 Dec 2010 02:58:00 GMT
Server: Omniture DC/2.0.0
X-C: ms-4.3.1
Expires: Sat, 04 Dec 2010 02:58:00 GMT
Last-Modified: Mon, 06 Dec 2010 02:58:00 GMT
Cache-Control: no-cache, no-store, must-revalidate, max-age=0, proxy-revalidate, no-transform, private
Pragma: no-cache
ETag: "4CFAFFB8-0E4C-7443902F"
Vary: *
P3P: policyref="/w3c/p3p.xml", CP="NOI DSP COR NID PSA OUR IND COM NAV STA"
Location: http://b.scorecardresearch.com/r?c2=6035250&d.c=gif&d.o=guardiangu-
network&d.x=243551159&d.t=page&d.u=http%3A%2F%2Fwww.guardian.co.uk%2Fnews%2Fdatablog%2F2010%2Fnov%2F08
%2Fprisoner-of-war-camps-uk
xserver: www422
Content-Length: 0
Keep-Alive: timeout=15
Connection: close
Content-Type: text/plain
Plain text lists Hadoop cluster, address incompatibilities, divide into smaller batches
– Build plain text lists of pages and hyperlinks
– Remove error page (e.g., 404 Not Found)
– Remove pages not in .uk
– Standardize dates (many formats)
– Standardize hyperlinks (trailing /, etc.)
– Fix/remove invalid hyperlinks (whitespace, invalid characters, etc.)
Load results into Apache Hive (2.5 TB)
Network construction
• Grouped to 3rd level domain (e.g., ox.ac.uk)
• Grouped pages crawled at similar times (within 1,000 seconds)
• Edge weight between any two domains for a given year is the largest number of hyperlinks between those two domains for any group that year
hmrc. gov.uk ox.ac.uk
(2005, 2), (2006,8), ..., (2010, 13)
Limitations
• Boundary effects (.uk)
– Not really an issue for .ac.uk
• Variable timing of captures
• Completeness
Overall Statistics
Third-level-
domains:
e.g.
ox.ac.uk
Relative size of second-level-domains
Number of links within SLD per node
Cross-domain links (2010)
Absolute Normalized to target size
Case of ac.uk
121 UK universities websites and links 1) League table ranking 2) Group affiliation 3) Geographical location
Group Affiliations
League table ranking
Geography
Colour ~ intensity
Gravity Law σ𝑖𝑗 =
𝑠𝑖𝑗
𝑠𝑖𝑜𝑢𝑡𝑠𝑗
𝑖𝑛
𝑠𝑖𝑗 =𝑠𝑖𝑜𝑢𝑡𝑠𝑗
𝑖𝑛
𝑟0.28
Summary
• University affiliations weakly reflected
• Correlation between network centrality and league table rankings increasing
• Physical distance still important
Mapping the UK Webspace:
Scott A. Hale, Taha Yasseri, Josh Cowls, Eric T. Meyer, Ralph Schroeder, Helen Margetts
Fifteen Years of British Universities on the Web
@computermacgyve, @etmeyer
With our thanks to Ning Wang, Adham Tamer, Andreas Kaltenbrunner, and our reviewers.