mapping the uk webspace: fifteen years of british universities on the web

18
Mapping the UK Webspace: Scott A. Hale, Taha Yasseri, Josh Cowls, Eric T. Meyer, Ralph Schroeder, Helen Margetts Fifteen Years of British Universities on the Web @computermacgyve, @etmeyer With our thanks to Ning Wang, Adham Tamer, Andreas Kaltenbrunner, and our reviewers.

Upload: scott-a-hale

Post on 13-Jul-2015

2.598 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Mapping the UK Webspace:

Scott A. Hale, Taha Yasseri, Josh Cowls, Eric T. Meyer, Ralph Schroeder, Helen Margetts

Fifteen Years of British Universities on the Web

@computermacgyve, @etmeyer

With our thanks to Ning Wang, Adham Tamer, Andreas Kaltenbrunner, and our reviewers.

Page 2: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Background, Motivation

• Web archives under-used

• Few longitudinal studies of the Web

• Clear division of sites by second level domain within .uk

– Academic websites (.ac.uk)

– Government websites (.gov.uk)

– Commercial websites (.co.uk)

Page 3: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Web Archive Dataset Preparation

30 TB compressed data

6.2TB metadata and links

2.5 TB temporal links

Page 4: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

WARC/1.0

WARC-Type: response

WARC-Target-URI: http://hits.guardian.co.uk/b/ss/guardiangu-blogs,guardiangu-news,guardiangu-

network/1/H.22.2/56938?ns=guardian&pageName=Prisoner+of+war+camps+in+the+UK+mapped+and+listed.+Download+the+d

ata%3AGraphic%3A1476560&ch=News&c3=GU.co.uk&c4=History+%28Books+genre%29%2CBooks%2CSecond+world+war+

%28News%29%2CGermany%2CUK+news%2CTechnology&c5=Not+commercially+useful%2CCorporate+IT&c6=Simon+Roger

s&c7=10-Nov-

08&c8=1476560&c9=Graphic&c10=Blogpost&c11=News&c13=&c25=Datablog&c30=content&h2=GU%2FNews%2Fblog%2FDa

tablog&c2=GUID:(none)

WARC-Date: 2010-12-05T02:58:00Z

WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ

WARC-IP-Address: 66.235.138.18

WARC-Record-ID: <urn:uuid:7d5ce147-9b4b-46cb-8975-ee93b4d0dda8>

Content-Type: application/http; msgtype=response

Content-Length: 740

HTTP/1.1 302 Found

Date: Sun, 05 Dec 2010 02:58:00 GMT

Server: Omniture DC/2.0.0

X-C: ms-4.3.1

Expires: Sat, 04 Dec 2010 02:58:00 GMT

Last-Modified: Mon, 06 Dec 2010 02:58:00 GMT

Cache-Control: no-cache, no-store, must-revalidate, max-age=0, proxy-revalidate, no-transform, private

Pragma: no-cache

ETag: "4CFAFFB8-0E4C-7443902F"

Vary: *

P3P: policyref="/w3c/p3p.xml", CP="NOI DSP COR NID PSA OUR IND COM NAV STA"

Location: http://b.scorecardresearch.com/r?c2=6035250&d.c=gif&d.o=guardiangu-

network&d.x=243551159&d.t=page&d.u=http%3A%2F%2Fwww.guardian.co.uk%2Fnews%2Fdatablog%2F2010%2Fnov%2F08

%2Fprisoner-of-war-camps-uk

xserver: www422

Content-Length: 0

Keep-Alive: timeout=15

Connection: close

Content-Type: text/plain

Page 5: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Plain text lists Hadoop cluster, address incompatibilities, divide into smaller batches

– Build plain text lists of pages and hyperlinks

– Remove error page (e.g., 404 Not Found)

– Remove pages not in .uk

– Standardize dates (many formats)

– Standardize hyperlinks (trailing /, etc.)

– Fix/remove invalid hyperlinks (whitespace, invalid characters, etc.)

Load results into Apache Hive (2.5 TB)

Page 6: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Network construction

• Grouped to 3rd level domain (e.g., ox.ac.uk)

• Grouped pages crawled at similar times (within 1,000 seconds)

• Edge weight between any two domains for a given year is the largest number of hyperlinks between those two domains for any group that year

hmrc. gov.uk ox.ac.uk

(2005, 2), (2006,8), ..., (2010, 13)

Page 7: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Limitations

• Boundary effects (.uk)

– Not really an issue for .ac.uk

• Variable timing of captures

• Completeness

Page 8: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Overall Statistics

Third-level-

domains:

e.g.

ox.ac.uk

Page 9: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Relative size of second-level-domains

Page 10: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Number of links within SLD per node

Page 11: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Cross-domain links (2010)

Absolute Normalized to target size

Page 12: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Case of ac.uk

121 UK universities websites and links 1) League table ranking 2) Group affiliation 3) Geographical location

Page 13: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Group Affiliations

Page 14: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

League table ranking

Page 15: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Geography

Colour ~ intensity

Page 16: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Gravity Law σ𝑖𝑗 =

𝑠𝑖𝑗

𝑠𝑖𝑜𝑢𝑡𝑠𝑗

𝑖𝑛

𝑠𝑖𝑗 =𝑠𝑖𝑜𝑢𝑡𝑠𝑗

𝑖𝑛

𝑟0.28

Page 17: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Summary

• University affiliations weakly reflected

• Correlation between network centrality and league table rankings increasing

• Physical distance still important

Page 18: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

Mapping the UK Webspace:

Scott A. Hale, Taha Yasseri, Josh Cowls, Eric T. Meyer, Ralph Schroeder, Helen Margetts

Fifteen Years of British Universities on the Web

@computermacgyve, @etmeyer

With our thanks to Ning Wang, Adham Tamer, Andreas Kaltenbrunner, and our reviewers.