edinburgh research explorer · williams, o & farquhar, a 2017, 'enabling complex analysis...

12
Edinburgh Research Explorer Enabling complex analysis of large-scale digital collections Citation for published version: Terras, M, Baker, J, Hetherington, J, Beavan, D, Zaltz Austwick, M, Welsh, A, O'Neill, H, Finley, W, Duke- Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and transforming access to British Library digital collections', Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqx020 Digital Object Identifier (DOI): 10.1093/llc/fqx020 Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Digital Scholarship in the Humanities General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 07. Jun. 2020

Upload: others

Post on 01-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

Edinburgh Research Explorer

Enabling complex analysis of large-scale digital collections

Citation for published versionTerras M Baker J Hetherington J Beavan D Zaltz Austwick M Welsh A ONeill H Finley W Duke-Williams O amp Farquhar A 2017 Enabling complex analysis of large-scale digital collections Humanitiesresearch high-performance computing and transforming access to British Library digital collections DigitalScholarship in the Humanities httpsdoiorg101093llcfqx020

Digital Object Identifier (DOI)101093llcfqx020

LinkLink to publication record in Edinburgh Research Explorer

Document VersionPublishers PDF also known as Version of record

Published InDigital Scholarship in the Humanities

General rightsCopyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)and or other copyright owners and it is a condition of accessing these publications that users recognise andabide by the legal requirements associated with these rights

Take down policyThe University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorercontent complies with UK legislation If you believe that the public display of this file breaches copyright pleasecontact openaccessedacuk providing details and we will remove access to the work immediately andinvestigate your claim

Download date 07 Jun 2020

Enabling complex analysis of large-scale digital collections humanitiesresearch high-performance com-puting and transforming access toBritish Library digital collections

Melissa Terras

DepartmentofInformationStudiesUniversityCollegeLondonUKand

UCL Centre for Digital Humanities University College London UK

James Baker

School of History Art History and Philosophy University of

Sussex UK

James Hetherington

Research Software Development Group Research IT Services

University College London UK

David Beavan

UCL Centre for Digital Humanities University College London UK

Martin Zaltz Austwick

Centre for Advanced Spatial Analysis University College London UK

Anne Welsh

Department of Information Studies University College London UK

Helen OrsquoNeill

Department of Information Studies University College London

UK The London Library UK

Will Finley

Department of History University of Sheffield UK

Oliver Duke-Williams

Department of Information Studies University College London UK

Adam Farquhar

Digital Scholarship British Library UK

Digital Scholarship in the Humanities The Author 2017 Published by Oxford University Press on behalf of EADHThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) which permits unrestricted reuse distribution and reproduction in any medium provided the original work isproperly cited

1 of 11

doi101093llcfqx020Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

AbstractAlthough there has been a drive in the cultural heritage sector to provide large-scale open data sets for researchers we have not seen a commensurate rise inhumanities researchers undertaking complex analysis of these data sets for theirown research purposes This article reports on a pilot project at UniversityCollege London working in collaboration with the British Library to scopeout how best high-performance computing facilities can be used to facilitatethe needs of researchers in the humanities Using institutional data-processingframeworks routinely used to support scientific research we assisted four huma-nities researchers in analysing 60000 digitized books and we present two result-ing case studies here This research allowed us to identify infrastructural andprocedural barriers and make recommendations on resource allocation to bestsupport non-computational researchers in undertaking lsquobig datarsquo research Werecommend that research software engineer capacity can be most efficiently de-ployed in maintaining and supporting data sets while librarians can provide anessential service in running initial routine queries for humanities scholars Atpresent there are too many technical hurdles for most individuals in the huma-nities to consider analysing at scale these increasingly available open data setsand by building on existing frameworks of support from research computing andlibrary services we can best support humanities scholars in developing methodsand approaches to take advantage of these research opportunities

1 Introduction

How best can humanities researchers access andanalyse large-scale digital data sets available frominstitutions in the cultural and heritage sectorWhat barriers remain in place for those from thehumanities wishing to use high-performance com-puting (HPC) to provide insights into historicaldata sets using lsquobig-datarsquo analytical techniquesThis article describes a pilot project that workedin collaboration with non-computationally trainedhumanities researchers to identify and overcomebarriers to complex analysis of large-scale digitalcollections It used institutional university frame-works that routinely support the processing oflarge-scale data sets for research purposes in thesciences The project brought together humanitiesresearchers research software engineers (Hettrick2016) and information professionals from theBritish Library Digital Scholarship Department1

University College London (UCL) Centre forDigital Humanities2 UCL Centre for AdvancedSpatial Analysis3 and UCL Research IT Services(UCL RITS)4 to analyse an open-licenced large-scale data set from the British Library While

useful research results were generated undertakingthis project clarified the technical and proceduralbarriers that exist when humanities researchers at-tempt to utilize computational research infrastruc-tures in the pursuit of their own research questions

2 Overview

The drive in the gallery library archive andmuseum (GLAM) sector towards opening up collec-tions data5 as well as the growth in data publishedby publicly funded research projects means huma-nities researchers have a wealth of large-scale digitalcollections available to them (Lui 2015 Terras2015) Many of these data sets are released underopen licences that permit uninhibited use by anyonewith an Internet collection and modest storage cap-acity A few humanities researchers have exploitedthese resources and their interpretations makeclaims that change our understanding of culturalphenomena (Smith et al 2013 Schmidt 2014Smith et al 2015 Huber 2007 Leetaru 2015)Nevertheless there remain major barriers to thewidespread uptake of these data sets and related

Correspondence

Melissa Terras UCL

Department of Information

Studies Foster Court

University College London

Gower Street London

WC1E 6BT UK

E-mail

mterrasuclacuk

M Terras et al

2 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computational approaches by humanities re-searchers which risks diminishing the relevance ofthe humanities in lsquobig datarsquo analysis (Wynne 2015)These barriers include

fragmentation of communities resources andtools

lack of interoperability complexity and incompleteness of heterogeneous

cultural heritage data sets (Terras 2009) and lack of technical skills lsquomainstream researchers

in the humanities and social sciences often donrsquotknow what the new possibilities arersquo (ibid) andseldom have the technical experience to experi-ment (Hughes 2009 Mahony and Pierazzo2012)

A common response to this lack of awarenessand computational skills is to build Web-basedinterfaces to data6 or federated services and infra-structures7 While these interfaces play a positiverole in introducing humanities researchers tolarge-scale digital collections they rarely fulfil thecomplex needs of humanities research which con-stantly questions received approaches and results orallow researchers to tailor analysis without beinglimited by shared assumptions and methods(Wynne 2013)

3 Method

We explored the challenges associated with deploy-ing and working with large-scale digital collectionssuitable for humanities research using a publicdomain digital collection provided by the BritishLibrary8 This circa 60000-book data set covers fic-tion and non-fiction publications from the 17th18th and 19th centuries ormdashseen as datamdash224 GB of compressed ALTO XML that includesboth content (captured using an Optical CharacterRecognition (OCR) process) and the location ofthat content on a page9 Using UCLrsquos centrallyfunded computing facilities10 we worked fromMarchndashJuly 2015 with UCL RITS and a cohort offour humanities researchers (from doctoral candi-dates to mid-career scholars) to ask queries thatcould not be satisfied by search- and discovery-

orientated graphical user interfaces Working in col-laboration we turned their research questions intocomputational queries explored ways in which thereturned data could be visualized and capturedtheir thoughts on the process through semi-struc-tured interviews

4 Results

We successfully ran queries across the data set thattracked linguistic change identified core phrasesplotted the placement of illustrations and mappedlocations mentioned within core texts The semi-structured interviews conducted with non-compu-tationally trained humanities researchers at variousstages during the collaborative work supported fourkey findings First that breaking down a researchquestion into a series of more defined computa-tional queries was time-consuming and challengingSecondly that the iterative nature of this researchmethodology puts pressure on the time taken toexecute queries and that long processing timeswere frustrating Thirdly that full comprehensionof the programming code was not necessary to pro-cess data and use their outputs in research thoughunderstanding the inputs outputs and effects ofparameters was required Fourthly that creatingderived data sets of a size manageable by desktopPCs11 opened up further investigation using estab-lished methods Indeed we found that buildingqueries that generate derived data sets from large-scale digital collections (small enough to be workedon locally with familiar tools) is an effective meansof empowering non-computationally trained huma-nities researchers to develop the skill sets required toundertake complex analysis of humanities data12

Our case studies deepen and add nuance to thesefindings Two of our case studies were interested inlooking at instances of particular words or phrasesin the corpus (for example lsquoprofessorrsquo) or particu-lar combinations of phrases within the corpus(lsquohigher educationrsquo) to identify a particular institu-tion and group of persons across time The require-ments from the researchers were to return thecomplete page of text that surrounded each ex-ample This was found to be technically quite

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 3 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

straightforward and resulted in a text file being de-livered to the Humanities Scholars which they couldthen lsquoclose readrsquo to analyse each instance of thesearch term within a given page of the book in thecorpus Analysis in this case entails finding instancesof the search term in question however there arefurther possibilities that can interrogate the data setfurther in procedurally and methodologically novelways We present here two more ambitious casestudies that allowed for further visualization andanalysis

41 Case Study 1 history of medicineDuke-Williams is a senior lecturer in DigitalInformation Studies in the Department ofInformation Studies at UCL13 and his researchinterests include the presentation of spatial dataand dissemination of demographic data and thepast present and future of demographic data

capture in the UK Visualization of these kinds ofdata can be used to explore issues around the spreadof diseases and the research questions were howdoes the occurrence of diseases in published litera-ture compare to known epidemics in the 19th cen-tury Can we see any correlation between theoccurrence of infectious diseases in society and ref-erence to these diseases in both fiction and non-fiction

Variations in the number of mentions of cholera(Fig 1 continuous black line) were compared torecorded epidemics (shaded bars on Fig 1)A sharp rise in mentions coincides with the firstcholera epidemic in the UK of 1831ndash32 a similarbut less pronounced rise is coincident with the1848ndash49 epidemic A more volatile pattern of men-tions is observed after this point subsequent spikesmay be associated with epidemics within andbeyond the UK or may be less directly related to

0

05

1

15

2

25

3

1790 1810 1830 1850 1870 1890 1910

Num

ber

of re

fere

nces

per

100

0 w

ords

Date of Publication

Disease References

year_cholera_epidemics Whooping Cholera Consumpon Measles

Fig 1 A search for mentions of various infectious diseases (cholera whooping cough consumption and measles)across the 60000-book data set We compared the profound spikes for cholera in the data set with known dataregarding epidemics in the UK (Chadwick 1842 Wall 1893) which appear as the bars on the graph showing arelationship between the first major UK outbreak of cholera and its appearance within the written record of thetime (in 1831ndash32) and again with the second UK epidemic (1848ndash49) Later outbreaks (1853ndash54 and 1863) do notsee this same correlation There are further pronounced spikes for mentions of cholera in the1870s and 1880s these arenot associated with UK epidemics but there were outbreaks in the USA and elsewhere Identifying the texts that refer tothese outbreaks allows us to look more closely at these clusters and to understand the relationship between publichealth epidemiology and the published historical record

M Terras et al

4 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease

This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets

42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14

Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale

Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the

changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections

To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries

First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes

Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 5 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration

Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works

Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject

5 Infrastructuralrecommendations

From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of

0

25

50

75

100

7505002500

Imag

e si

ze (

of p

age)

Page number

Illustration Area by Page for Book ID 002322267

002322267_1 002322267_2

Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume

M Terras et al

6 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)

The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future

Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be

presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)

A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services

Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 7 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward

6 Conclusion

We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities

(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences

(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data

Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and

deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity

AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance

ReferencesAtkins D E Borgman C L Bindhoff N Ellisman

M Felman S Foster I and Heck A (2010)

lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils

UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-

for-the-transformative-enhancement-of-research-and-

innovation

Bates M J (1996) The Getty end-user online searching

project in the humanities report no 6 overview and

conclusions College and Research Libraries 57(6)514ndash23

Bedi S and Walde C (2016) Transforming roles

Canadian academic librarians embedded in faculty re-

search projects College and Research Libraries http

M Terras et al

8 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

crlacrlorgcontentearly20160322crl16-871

abstract

Bradigan P S and Mularski C A (1989) End-user

searching in a medical school curriculum an evaluated

modular approach Bulletin of the Medical Library

Association 77(4) 348ndash56

Brettle A Maden-Jenkins M Anderson L McNally

R Pratchett T Tancock J Thornton D and

Webb A (2011) Evaluating clinical librarian services

A systematic review Health Information and Libraries

Journal 28(1) 3ndash22

Burke J and Tumbleson B (2016) LMS embedded li-

brarianship and the educational role of librarians

Library Technology Reports 52(2) 5ndash9

Chadwick E (1842) Report on the sanitary condition of

the labouring population of Great Britain supplemen-

tary report on the results of special inquiry into the

practice of interment in towns (Vol 1) HM

Stationery Office

Donald D (1996) The Age of Caricature Satirical Prints

in the Reign of George III New Haven Published for the

Paul Mellon Centre for Studies in British Art by Yale

University Press

Farber M and Shoham S (2002) Users end-users

and end-user searchers of online information a his-

torical overview Online Information Review 26(2)

92ndash100

Feliu V and Frazer H (2012) Embedded librarians

teaching legal research as a lawyering skill Journal of

Legal Education 61(4) 540ndash59

Groen D Hetherington J Carver H B Nash R

W Bernabeu M O and Coveney P V (2013)

Analysing and modelling the performance of the

HemeLB lattice-Boltzmann simulation environ-

ment Journal of Computational Science 4(5) 412ndash

22

Hetherington J (2017) Question about data capacity

and use at UCL Response to Melissa Terras via email

27 January 2017

Hettrick S (2016) A not-so-brief history of Research

Software Engineers Software Sustainability Institute

httpswwwsoftwareacukblog2016-08-19-not-so-

brief-history-research-software-engineers

Huber M (2007) The Old Bailey Proceedings 1674-1834

Evaluating and annotating a corpus of 18th - and 19th-

century spoken English Studies in Variation Contacts and

Change in English 1 Annotating Variation and Change

httpwwwhelsinkifivariengseriesvolumes01huber

Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx

Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30

Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86

Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books

Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13

Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets

Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne

Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics

Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress

Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81

Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30

Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 9 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 2: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

Enabling complex analysis of large-scale digital collections humanitiesresearch high-performance com-puting and transforming access toBritish Library digital collections

Melissa Terras

DepartmentofInformationStudiesUniversityCollegeLondonUKand

UCL Centre for Digital Humanities University College London UK

James Baker

School of History Art History and Philosophy University of

Sussex UK

James Hetherington

Research Software Development Group Research IT Services

University College London UK

David Beavan

UCL Centre for Digital Humanities University College London UK

Martin Zaltz Austwick

Centre for Advanced Spatial Analysis University College London UK

Anne Welsh

Department of Information Studies University College London UK

Helen OrsquoNeill

Department of Information Studies University College London

UK The London Library UK

Will Finley

Department of History University of Sheffield UK

Oliver Duke-Williams

Department of Information Studies University College London UK

Adam Farquhar

Digital Scholarship British Library UK

Digital Scholarship in the Humanities The Author 2017 Published by Oxford University Press on behalf of EADHThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) which permits unrestricted reuse distribution and reproduction in any medium provided the original work isproperly cited

1 of 11

doi101093llcfqx020Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

AbstractAlthough there has been a drive in the cultural heritage sector to provide large-scale open data sets for researchers we have not seen a commensurate rise inhumanities researchers undertaking complex analysis of these data sets for theirown research purposes This article reports on a pilot project at UniversityCollege London working in collaboration with the British Library to scopeout how best high-performance computing facilities can be used to facilitatethe needs of researchers in the humanities Using institutional data-processingframeworks routinely used to support scientific research we assisted four huma-nities researchers in analysing 60000 digitized books and we present two result-ing case studies here This research allowed us to identify infrastructural andprocedural barriers and make recommendations on resource allocation to bestsupport non-computational researchers in undertaking lsquobig datarsquo research Werecommend that research software engineer capacity can be most efficiently de-ployed in maintaining and supporting data sets while librarians can provide anessential service in running initial routine queries for humanities scholars Atpresent there are too many technical hurdles for most individuals in the huma-nities to consider analysing at scale these increasingly available open data setsand by building on existing frameworks of support from research computing andlibrary services we can best support humanities scholars in developing methodsand approaches to take advantage of these research opportunities

1 Introduction

How best can humanities researchers access andanalyse large-scale digital data sets available frominstitutions in the cultural and heritage sectorWhat barriers remain in place for those from thehumanities wishing to use high-performance com-puting (HPC) to provide insights into historicaldata sets using lsquobig-datarsquo analytical techniquesThis article describes a pilot project that workedin collaboration with non-computationally trainedhumanities researchers to identify and overcomebarriers to complex analysis of large-scale digitalcollections It used institutional university frame-works that routinely support the processing oflarge-scale data sets for research purposes in thesciences The project brought together humanitiesresearchers research software engineers (Hettrick2016) and information professionals from theBritish Library Digital Scholarship Department1

University College London (UCL) Centre forDigital Humanities2 UCL Centre for AdvancedSpatial Analysis3 and UCL Research IT Services(UCL RITS)4 to analyse an open-licenced large-scale data set from the British Library While

useful research results were generated undertakingthis project clarified the technical and proceduralbarriers that exist when humanities researchers at-tempt to utilize computational research infrastruc-tures in the pursuit of their own research questions

2 Overview

The drive in the gallery library archive andmuseum (GLAM) sector towards opening up collec-tions data5 as well as the growth in data publishedby publicly funded research projects means huma-nities researchers have a wealth of large-scale digitalcollections available to them (Lui 2015 Terras2015) Many of these data sets are released underopen licences that permit uninhibited use by anyonewith an Internet collection and modest storage cap-acity A few humanities researchers have exploitedthese resources and their interpretations makeclaims that change our understanding of culturalphenomena (Smith et al 2013 Schmidt 2014Smith et al 2015 Huber 2007 Leetaru 2015)Nevertheless there remain major barriers to thewidespread uptake of these data sets and related

Correspondence

Melissa Terras UCL

Department of Information

Studies Foster Court

University College London

Gower Street London

WC1E 6BT UK

E-mail

mterrasuclacuk

M Terras et al

2 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computational approaches by humanities re-searchers which risks diminishing the relevance ofthe humanities in lsquobig datarsquo analysis (Wynne 2015)These barriers include

fragmentation of communities resources andtools

lack of interoperability complexity and incompleteness of heterogeneous

cultural heritage data sets (Terras 2009) and lack of technical skills lsquomainstream researchers

in the humanities and social sciences often donrsquotknow what the new possibilities arersquo (ibid) andseldom have the technical experience to experi-ment (Hughes 2009 Mahony and Pierazzo2012)

A common response to this lack of awarenessand computational skills is to build Web-basedinterfaces to data6 or federated services and infra-structures7 While these interfaces play a positiverole in introducing humanities researchers tolarge-scale digital collections they rarely fulfil thecomplex needs of humanities research which con-stantly questions received approaches and results orallow researchers to tailor analysis without beinglimited by shared assumptions and methods(Wynne 2013)

3 Method

We explored the challenges associated with deploy-ing and working with large-scale digital collectionssuitable for humanities research using a publicdomain digital collection provided by the BritishLibrary8 This circa 60000-book data set covers fic-tion and non-fiction publications from the 17th18th and 19th centuries ormdashseen as datamdash224 GB of compressed ALTO XML that includesboth content (captured using an Optical CharacterRecognition (OCR) process) and the location ofthat content on a page9 Using UCLrsquos centrallyfunded computing facilities10 we worked fromMarchndashJuly 2015 with UCL RITS and a cohort offour humanities researchers (from doctoral candi-dates to mid-career scholars) to ask queries thatcould not be satisfied by search- and discovery-

orientated graphical user interfaces Working in col-laboration we turned their research questions intocomputational queries explored ways in which thereturned data could be visualized and capturedtheir thoughts on the process through semi-struc-tured interviews

4 Results

We successfully ran queries across the data set thattracked linguistic change identified core phrasesplotted the placement of illustrations and mappedlocations mentioned within core texts The semi-structured interviews conducted with non-compu-tationally trained humanities researchers at variousstages during the collaborative work supported fourkey findings First that breaking down a researchquestion into a series of more defined computa-tional queries was time-consuming and challengingSecondly that the iterative nature of this researchmethodology puts pressure on the time taken toexecute queries and that long processing timeswere frustrating Thirdly that full comprehensionof the programming code was not necessary to pro-cess data and use their outputs in research thoughunderstanding the inputs outputs and effects ofparameters was required Fourthly that creatingderived data sets of a size manageable by desktopPCs11 opened up further investigation using estab-lished methods Indeed we found that buildingqueries that generate derived data sets from large-scale digital collections (small enough to be workedon locally with familiar tools) is an effective meansof empowering non-computationally trained huma-nities researchers to develop the skill sets required toundertake complex analysis of humanities data12

Our case studies deepen and add nuance to thesefindings Two of our case studies were interested inlooking at instances of particular words or phrasesin the corpus (for example lsquoprofessorrsquo) or particu-lar combinations of phrases within the corpus(lsquohigher educationrsquo) to identify a particular institu-tion and group of persons across time The require-ments from the researchers were to return thecomplete page of text that surrounded each ex-ample This was found to be technically quite

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 3 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

straightforward and resulted in a text file being de-livered to the Humanities Scholars which they couldthen lsquoclose readrsquo to analyse each instance of thesearch term within a given page of the book in thecorpus Analysis in this case entails finding instancesof the search term in question however there arefurther possibilities that can interrogate the data setfurther in procedurally and methodologically novelways We present here two more ambitious casestudies that allowed for further visualization andanalysis

41 Case Study 1 history of medicineDuke-Williams is a senior lecturer in DigitalInformation Studies in the Department ofInformation Studies at UCL13 and his researchinterests include the presentation of spatial dataand dissemination of demographic data and thepast present and future of demographic data

capture in the UK Visualization of these kinds ofdata can be used to explore issues around the spreadof diseases and the research questions were howdoes the occurrence of diseases in published litera-ture compare to known epidemics in the 19th cen-tury Can we see any correlation between theoccurrence of infectious diseases in society and ref-erence to these diseases in both fiction and non-fiction

Variations in the number of mentions of cholera(Fig 1 continuous black line) were compared torecorded epidemics (shaded bars on Fig 1)A sharp rise in mentions coincides with the firstcholera epidemic in the UK of 1831ndash32 a similarbut less pronounced rise is coincident with the1848ndash49 epidemic A more volatile pattern of men-tions is observed after this point subsequent spikesmay be associated with epidemics within andbeyond the UK or may be less directly related to

0

05

1

15

2

25

3

1790 1810 1830 1850 1870 1890 1910

Num

ber

of re

fere

nces

per

100

0 w

ords

Date of Publication

Disease References

year_cholera_epidemics Whooping Cholera Consumpon Measles

Fig 1 A search for mentions of various infectious diseases (cholera whooping cough consumption and measles)across the 60000-book data set We compared the profound spikes for cholera in the data set with known dataregarding epidemics in the UK (Chadwick 1842 Wall 1893) which appear as the bars on the graph showing arelationship between the first major UK outbreak of cholera and its appearance within the written record of thetime (in 1831ndash32) and again with the second UK epidemic (1848ndash49) Later outbreaks (1853ndash54 and 1863) do notsee this same correlation There are further pronounced spikes for mentions of cholera in the1870s and 1880s these arenot associated with UK epidemics but there were outbreaks in the USA and elsewhere Identifying the texts that refer tothese outbreaks allows us to look more closely at these clusters and to understand the relationship between publichealth epidemiology and the published historical record

M Terras et al

4 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease

This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets

42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14

Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale

Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the

changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections

To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries

First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes

Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 5 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration

Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works

Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject

5 Infrastructuralrecommendations

From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of

0

25

50

75

100

7505002500

Imag

e si

ze (

of p

age)

Page number

Illustration Area by Page for Book ID 002322267

002322267_1 002322267_2

Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume

M Terras et al

6 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)

The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future

Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be

presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)

A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services

Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 7 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward

6 Conclusion

We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities

(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences

(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data

Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and

deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity

AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance

ReferencesAtkins D E Borgman C L Bindhoff N Ellisman

M Felman S Foster I and Heck A (2010)

lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils

UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-

for-the-transformative-enhancement-of-research-and-

innovation

Bates M J (1996) The Getty end-user online searching

project in the humanities report no 6 overview and

conclusions College and Research Libraries 57(6)514ndash23

Bedi S and Walde C (2016) Transforming roles

Canadian academic librarians embedded in faculty re-

search projects College and Research Libraries http

M Terras et al

8 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

crlacrlorgcontentearly20160322crl16-871

abstract

Bradigan P S and Mularski C A (1989) End-user

searching in a medical school curriculum an evaluated

modular approach Bulletin of the Medical Library

Association 77(4) 348ndash56

Brettle A Maden-Jenkins M Anderson L McNally

R Pratchett T Tancock J Thornton D and

Webb A (2011) Evaluating clinical librarian services

A systematic review Health Information and Libraries

Journal 28(1) 3ndash22

Burke J and Tumbleson B (2016) LMS embedded li-

brarianship and the educational role of librarians

Library Technology Reports 52(2) 5ndash9

Chadwick E (1842) Report on the sanitary condition of

the labouring population of Great Britain supplemen-

tary report on the results of special inquiry into the

practice of interment in towns (Vol 1) HM

Stationery Office

Donald D (1996) The Age of Caricature Satirical Prints

in the Reign of George III New Haven Published for the

Paul Mellon Centre for Studies in British Art by Yale

University Press

Farber M and Shoham S (2002) Users end-users

and end-user searchers of online information a his-

torical overview Online Information Review 26(2)

92ndash100

Feliu V and Frazer H (2012) Embedded librarians

teaching legal research as a lawyering skill Journal of

Legal Education 61(4) 540ndash59

Groen D Hetherington J Carver H B Nash R

W Bernabeu M O and Coveney P V (2013)

Analysing and modelling the performance of the

HemeLB lattice-Boltzmann simulation environ-

ment Journal of Computational Science 4(5) 412ndash

22

Hetherington J (2017) Question about data capacity

and use at UCL Response to Melissa Terras via email

27 January 2017

Hettrick S (2016) A not-so-brief history of Research

Software Engineers Software Sustainability Institute

httpswwwsoftwareacukblog2016-08-19-not-so-

brief-history-research-software-engineers

Huber M (2007) The Old Bailey Proceedings 1674-1834

Evaluating and annotating a corpus of 18th - and 19th-

century spoken English Studies in Variation Contacts and

Change in English 1 Annotating Variation and Change

httpwwwhelsinkifivariengseriesvolumes01huber

Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx

Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30

Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86

Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books

Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13

Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets

Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne

Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics

Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress

Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81

Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30

Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 9 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 3: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

AbstractAlthough there has been a drive in the cultural heritage sector to provide large-scale open data sets for researchers we have not seen a commensurate rise inhumanities researchers undertaking complex analysis of these data sets for theirown research purposes This article reports on a pilot project at UniversityCollege London working in collaboration with the British Library to scopeout how best high-performance computing facilities can be used to facilitatethe needs of researchers in the humanities Using institutional data-processingframeworks routinely used to support scientific research we assisted four huma-nities researchers in analysing 60000 digitized books and we present two result-ing case studies here This research allowed us to identify infrastructural andprocedural barriers and make recommendations on resource allocation to bestsupport non-computational researchers in undertaking lsquobig datarsquo research Werecommend that research software engineer capacity can be most efficiently de-ployed in maintaining and supporting data sets while librarians can provide anessential service in running initial routine queries for humanities scholars Atpresent there are too many technical hurdles for most individuals in the huma-nities to consider analysing at scale these increasingly available open data setsand by building on existing frameworks of support from research computing andlibrary services we can best support humanities scholars in developing methodsand approaches to take advantage of these research opportunities

1 Introduction

How best can humanities researchers access andanalyse large-scale digital data sets available frominstitutions in the cultural and heritage sectorWhat barriers remain in place for those from thehumanities wishing to use high-performance com-puting (HPC) to provide insights into historicaldata sets using lsquobig-datarsquo analytical techniquesThis article describes a pilot project that workedin collaboration with non-computationally trainedhumanities researchers to identify and overcomebarriers to complex analysis of large-scale digitalcollections It used institutional university frame-works that routinely support the processing oflarge-scale data sets for research purposes in thesciences The project brought together humanitiesresearchers research software engineers (Hettrick2016) and information professionals from theBritish Library Digital Scholarship Department1

University College London (UCL) Centre forDigital Humanities2 UCL Centre for AdvancedSpatial Analysis3 and UCL Research IT Services(UCL RITS)4 to analyse an open-licenced large-scale data set from the British Library While

useful research results were generated undertakingthis project clarified the technical and proceduralbarriers that exist when humanities researchers at-tempt to utilize computational research infrastruc-tures in the pursuit of their own research questions

2 Overview

The drive in the gallery library archive andmuseum (GLAM) sector towards opening up collec-tions data5 as well as the growth in data publishedby publicly funded research projects means huma-nities researchers have a wealth of large-scale digitalcollections available to them (Lui 2015 Terras2015) Many of these data sets are released underopen licences that permit uninhibited use by anyonewith an Internet collection and modest storage cap-acity A few humanities researchers have exploitedthese resources and their interpretations makeclaims that change our understanding of culturalphenomena (Smith et al 2013 Schmidt 2014Smith et al 2015 Huber 2007 Leetaru 2015)Nevertheless there remain major barriers to thewidespread uptake of these data sets and related

Correspondence

Melissa Terras UCL

Department of Information

Studies Foster Court

University College London

Gower Street London

WC1E 6BT UK

E-mail

mterrasuclacuk

M Terras et al

2 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computational approaches by humanities re-searchers which risks diminishing the relevance ofthe humanities in lsquobig datarsquo analysis (Wynne 2015)These barriers include

fragmentation of communities resources andtools

lack of interoperability complexity and incompleteness of heterogeneous

cultural heritage data sets (Terras 2009) and lack of technical skills lsquomainstream researchers

in the humanities and social sciences often donrsquotknow what the new possibilities arersquo (ibid) andseldom have the technical experience to experi-ment (Hughes 2009 Mahony and Pierazzo2012)

A common response to this lack of awarenessand computational skills is to build Web-basedinterfaces to data6 or federated services and infra-structures7 While these interfaces play a positiverole in introducing humanities researchers tolarge-scale digital collections they rarely fulfil thecomplex needs of humanities research which con-stantly questions received approaches and results orallow researchers to tailor analysis without beinglimited by shared assumptions and methods(Wynne 2013)

3 Method

We explored the challenges associated with deploy-ing and working with large-scale digital collectionssuitable for humanities research using a publicdomain digital collection provided by the BritishLibrary8 This circa 60000-book data set covers fic-tion and non-fiction publications from the 17th18th and 19th centuries ormdashseen as datamdash224 GB of compressed ALTO XML that includesboth content (captured using an Optical CharacterRecognition (OCR) process) and the location ofthat content on a page9 Using UCLrsquos centrallyfunded computing facilities10 we worked fromMarchndashJuly 2015 with UCL RITS and a cohort offour humanities researchers (from doctoral candi-dates to mid-career scholars) to ask queries thatcould not be satisfied by search- and discovery-

orientated graphical user interfaces Working in col-laboration we turned their research questions intocomputational queries explored ways in which thereturned data could be visualized and capturedtheir thoughts on the process through semi-struc-tured interviews

4 Results

We successfully ran queries across the data set thattracked linguistic change identified core phrasesplotted the placement of illustrations and mappedlocations mentioned within core texts The semi-structured interviews conducted with non-compu-tationally trained humanities researchers at variousstages during the collaborative work supported fourkey findings First that breaking down a researchquestion into a series of more defined computa-tional queries was time-consuming and challengingSecondly that the iterative nature of this researchmethodology puts pressure on the time taken toexecute queries and that long processing timeswere frustrating Thirdly that full comprehensionof the programming code was not necessary to pro-cess data and use their outputs in research thoughunderstanding the inputs outputs and effects ofparameters was required Fourthly that creatingderived data sets of a size manageable by desktopPCs11 opened up further investigation using estab-lished methods Indeed we found that buildingqueries that generate derived data sets from large-scale digital collections (small enough to be workedon locally with familiar tools) is an effective meansof empowering non-computationally trained huma-nities researchers to develop the skill sets required toundertake complex analysis of humanities data12

Our case studies deepen and add nuance to thesefindings Two of our case studies were interested inlooking at instances of particular words or phrasesin the corpus (for example lsquoprofessorrsquo) or particu-lar combinations of phrases within the corpus(lsquohigher educationrsquo) to identify a particular institu-tion and group of persons across time The require-ments from the researchers were to return thecomplete page of text that surrounded each ex-ample This was found to be technically quite

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 3 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

straightforward and resulted in a text file being de-livered to the Humanities Scholars which they couldthen lsquoclose readrsquo to analyse each instance of thesearch term within a given page of the book in thecorpus Analysis in this case entails finding instancesof the search term in question however there arefurther possibilities that can interrogate the data setfurther in procedurally and methodologically novelways We present here two more ambitious casestudies that allowed for further visualization andanalysis

41 Case Study 1 history of medicineDuke-Williams is a senior lecturer in DigitalInformation Studies in the Department ofInformation Studies at UCL13 and his researchinterests include the presentation of spatial dataand dissemination of demographic data and thepast present and future of demographic data

capture in the UK Visualization of these kinds ofdata can be used to explore issues around the spreadof diseases and the research questions were howdoes the occurrence of diseases in published litera-ture compare to known epidemics in the 19th cen-tury Can we see any correlation between theoccurrence of infectious diseases in society and ref-erence to these diseases in both fiction and non-fiction

Variations in the number of mentions of cholera(Fig 1 continuous black line) were compared torecorded epidemics (shaded bars on Fig 1)A sharp rise in mentions coincides with the firstcholera epidemic in the UK of 1831ndash32 a similarbut less pronounced rise is coincident with the1848ndash49 epidemic A more volatile pattern of men-tions is observed after this point subsequent spikesmay be associated with epidemics within andbeyond the UK or may be less directly related to

0

05

1

15

2

25

3

1790 1810 1830 1850 1870 1890 1910

Num

ber

of re

fere

nces

per

100

0 w

ords

Date of Publication

Disease References

year_cholera_epidemics Whooping Cholera Consumpon Measles

Fig 1 A search for mentions of various infectious diseases (cholera whooping cough consumption and measles)across the 60000-book data set We compared the profound spikes for cholera in the data set with known dataregarding epidemics in the UK (Chadwick 1842 Wall 1893) which appear as the bars on the graph showing arelationship between the first major UK outbreak of cholera and its appearance within the written record of thetime (in 1831ndash32) and again with the second UK epidemic (1848ndash49) Later outbreaks (1853ndash54 and 1863) do notsee this same correlation There are further pronounced spikes for mentions of cholera in the1870s and 1880s these arenot associated with UK epidemics but there were outbreaks in the USA and elsewhere Identifying the texts that refer tothese outbreaks allows us to look more closely at these clusters and to understand the relationship between publichealth epidemiology and the published historical record

M Terras et al

4 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease

This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets

42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14

Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale

Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the

changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections

To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries

First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes

Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 5 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration

Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works

Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject

5 Infrastructuralrecommendations

From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of

0

25

50

75

100

7505002500

Imag

e si

ze (

of p

age)

Page number

Illustration Area by Page for Book ID 002322267

002322267_1 002322267_2

Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume

M Terras et al

6 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)

The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future

Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be

presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)

A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services

Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 7 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward

6 Conclusion

We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities

(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences

(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data

Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and

deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity

AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance

ReferencesAtkins D E Borgman C L Bindhoff N Ellisman

M Felman S Foster I and Heck A (2010)

lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils

UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-

for-the-transformative-enhancement-of-research-and-

innovation

Bates M J (1996) The Getty end-user online searching

project in the humanities report no 6 overview and

conclusions College and Research Libraries 57(6)514ndash23

Bedi S and Walde C (2016) Transforming roles

Canadian academic librarians embedded in faculty re-

search projects College and Research Libraries http

M Terras et al

8 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

crlacrlorgcontentearly20160322crl16-871

abstract

Bradigan P S and Mularski C A (1989) End-user

searching in a medical school curriculum an evaluated

modular approach Bulletin of the Medical Library

Association 77(4) 348ndash56

Brettle A Maden-Jenkins M Anderson L McNally

R Pratchett T Tancock J Thornton D and

Webb A (2011) Evaluating clinical librarian services

A systematic review Health Information and Libraries

Journal 28(1) 3ndash22

Burke J and Tumbleson B (2016) LMS embedded li-

brarianship and the educational role of librarians

Library Technology Reports 52(2) 5ndash9

Chadwick E (1842) Report on the sanitary condition of

the labouring population of Great Britain supplemen-

tary report on the results of special inquiry into the

practice of interment in towns (Vol 1) HM

Stationery Office

Donald D (1996) The Age of Caricature Satirical Prints

in the Reign of George III New Haven Published for the

Paul Mellon Centre for Studies in British Art by Yale

University Press

Farber M and Shoham S (2002) Users end-users

and end-user searchers of online information a his-

torical overview Online Information Review 26(2)

92ndash100

Feliu V and Frazer H (2012) Embedded librarians

teaching legal research as a lawyering skill Journal of

Legal Education 61(4) 540ndash59

Groen D Hetherington J Carver H B Nash R

W Bernabeu M O and Coveney P V (2013)

Analysing and modelling the performance of the

HemeLB lattice-Boltzmann simulation environ-

ment Journal of Computational Science 4(5) 412ndash

22

Hetherington J (2017) Question about data capacity

and use at UCL Response to Melissa Terras via email

27 January 2017

Hettrick S (2016) A not-so-brief history of Research

Software Engineers Software Sustainability Institute

httpswwwsoftwareacukblog2016-08-19-not-so-

brief-history-research-software-engineers

Huber M (2007) The Old Bailey Proceedings 1674-1834

Evaluating and annotating a corpus of 18th - and 19th-

century spoken English Studies in Variation Contacts and

Change in English 1 Annotating Variation and Change

httpwwwhelsinkifivariengseriesvolumes01huber

Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx

Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30

Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86

Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books

Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13

Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets

Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne

Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics

Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress

Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81

Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30

Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 9 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 4: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

computational approaches by humanities re-searchers which risks diminishing the relevance ofthe humanities in lsquobig datarsquo analysis (Wynne 2015)These barriers include

fragmentation of communities resources andtools

lack of interoperability complexity and incompleteness of heterogeneous

cultural heritage data sets (Terras 2009) and lack of technical skills lsquomainstream researchers

in the humanities and social sciences often donrsquotknow what the new possibilities arersquo (ibid) andseldom have the technical experience to experi-ment (Hughes 2009 Mahony and Pierazzo2012)

A common response to this lack of awarenessand computational skills is to build Web-basedinterfaces to data6 or federated services and infra-structures7 While these interfaces play a positiverole in introducing humanities researchers tolarge-scale digital collections they rarely fulfil thecomplex needs of humanities research which con-stantly questions received approaches and results orallow researchers to tailor analysis without beinglimited by shared assumptions and methods(Wynne 2013)

3 Method

We explored the challenges associated with deploy-ing and working with large-scale digital collectionssuitable for humanities research using a publicdomain digital collection provided by the BritishLibrary8 This circa 60000-book data set covers fic-tion and non-fiction publications from the 17th18th and 19th centuries ormdashseen as datamdash224 GB of compressed ALTO XML that includesboth content (captured using an Optical CharacterRecognition (OCR) process) and the location ofthat content on a page9 Using UCLrsquos centrallyfunded computing facilities10 we worked fromMarchndashJuly 2015 with UCL RITS and a cohort offour humanities researchers (from doctoral candi-dates to mid-career scholars) to ask queries thatcould not be satisfied by search- and discovery-

orientated graphical user interfaces Working in col-laboration we turned their research questions intocomputational queries explored ways in which thereturned data could be visualized and capturedtheir thoughts on the process through semi-struc-tured interviews

4 Results

We successfully ran queries across the data set thattracked linguistic change identified core phrasesplotted the placement of illustrations and mappedlocations mentioned within core texts The semi-structured interviews conducted with non-compu-tationally trained humanities researchers at variousstages during the collaborative work supported fourkey findings First that breaking down a researchquestion into a series of more defined computa-tional queries was time-consuming and challengingSecondly that the iterative nature of this researchmethodology puts pressure on the time taken toexecute queries and that long processing timeswere frustrating Thirdly that full comprehensionof the programming code was not necessary to pro-cess data and use their outputs in research thoughunderstanding the inputs outputs and effects ofparameters was required Fourthly that creatingderived data sets of a size manageable by desktopPCs11 opened up further investigation using estab-lished methods Indeed we found that buildingqueries that generate derived data sets from large-scale digital collections (small enough to be workedon locally with familiar tools) is an effective meansof empowering non-computationally trained huma-nities researchers to develop the skill sets required toundertake complex analysis of humanities data12

Our case studies deepen and add nuance to thesefindings Two of our case studies were interested inlooking at instances of particular words or phrasesin the corpus (for example lsquoprofessorrsquo) or particu-lar combinations of phrases within the corpus(lsquohigher educationrsquo) to identify a particular institu-tion and group of persons across time The require-ments from the researchers were to return thecomplete page of text that surrounded each ex-ample This was found to be technically quite

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 3 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

straightforward and resulted in a text file being de-livered to the Humanities Scholars which they couldthen lsquoclose readrsquo to analyse each instance of thesearch term within a given page of the book in thecorpus Analysis in this case entails finding instancesof the search term in question however there arefurther possibilities that can interrogate the data setfurther in procedurally and methodologically novelways We present here two more ambitious casestudies that allowed for further visualization andanalysis

41 Case Study 1 history of medicineDuke-Williams is a senior lecturer in DigitalInformation Studies in the Department ofInformation Studies at UCL13 and his researchinterests include the presentation of spatial dataand dissemination of demographic data and thepast present and future of demographic data

capture in the UK Visualization of these kinds ofdata can be used to explore issues around the spreadof diseases and the research questions were howdoes the occurrence of diseases in published litera-ture compare to known epidemics in the 19th cen-tury Can we see any correlation between theoccurrence of infectious diseases in society and ref-erence to these diseases in both fiction and non-fiction

Variations in the number of mentions of cholera(Fig 1 continuous black line) were compared torecorded epidemics (shaded bars on Fig 1)A sharp rise in mentions coincides with the firstcholera epidemic in the UK of 1831ndash32 a similarbut less pronounced rise is coincident with the1848ndash49 epidemic A more volatile pattern of men-tions is observed after this point subsequent spikesmay be associated with epidemics within andbeyond the UK or may be less directly related to

0

05

1

15

2

25

3

1790 1810 1830 1850 1870 1890 1910

Num

ber

of re

fere

nces

per

100

0 w

ords

Date of Publication

Disease References

year_cholera_epidemics Whooping Cholera Consumpon Measles

Fig 1 A search for mentions of various infectious diseases (cholera whooping cough consumption and measles)across the 60000-book data set We compared the profound spikes for cholera in the data set with known dataregarding epidemics in the UK (Chadwick 1842 Wall 1893) which appear as the bars on the graph showing arelationship between the first major UK outbreak of cholera and its appearance within the written record of thetime (in 1831ndash32) and again with the second UK epidemic (1848ndash49) Later outbreaks (1853ndash54 and 1863) do notsee this same correlation There are further pronounced spikes for mentions of cholera in the1870s and 1880s these arenot associated with UK epidemics but there were outbreaks in the USA and elsewhere Identifying the texts that refer tothese outbreaks allows us to look more closely at these clusters and to understand the relationship between publichealth epidemiology and the published historical record

M Terras et al

4 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease

This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets

42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14

Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale

Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the

changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections

To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries

First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes

Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 5 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration

Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works

Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject

5 Infrastructuralrecommendations

From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of

0

25

50

75

100

7505002500

Imag

e si

ze (

of p

age)

Page number

Illustration Area by Page for Book ID 002322267

002322267_1 002322267_2

Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume

M Terras et al

6 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)

The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future

Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be

presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)

A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services

Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 7 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward

6 Conclusion

We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities

(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences

(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data

Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and

deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity

AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance

ReferencesAtkins D E Borgman C L Bindhoff N Ellisman

M Felman S Foster I and Heck A (2010)

lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils

UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-

for-the-transformative-enhancement-of-research-and-

innovation

Bates M J (1996) The Getty end-user online searching

project in the humanities report no 6 overview and

conclusions College and Research Libraries 57(6)514ndash23

Bedi S and Walde C (2016) Transforming roles

Canadian academic librarians embedded in faculty re-

search projects College and Research Libraries http

M Terras et al

8 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

crlacrlorgcontentearly20160322crl16-871

abstract

Bradigan P S and Mularski C A (1989) End-user

searching in a medical school curriculum an evaluated

modular approach Bulletin of the Medical Library

Association 77(4) 348ndash56

Brettle A Maden-Jenkins M Anderson L McNally

R Pratchett T Tancock J Thornton D and

Webb A (2011) Evaluating clinical librarian services

A systematic review Health Information and Libraries

Journal 28(1) 3ndash22

Burke J and Tumbleson B (2016) LMS embedded li-

brarianship and the educational role of librarians

Library Technology Reports 52(2) 5ndash9

Chadwick E (1842) Report on the sanitary condition of

the labouring population of Great Britain supplemen-

tary report on the results of special inquiry into the

practice of interment in towns (Vol 1) HM

Stationery Office

Donald D (1996) The Age of Caricature Satirical Prints

in the Reign of George III New Haven Published for the

Paul Mellon Centre for Studies in British Art by Yale

University Press

Farber M and Shoham S (2002) Users end-users

and end-user searchers of online information a his-

torical overview Online Information Review 26(2)

92ndash100

Feliu V and Frazer H (2012) Embedded librarians

teaching legal research as a lawyering skill Journal of

Legal Education 61(4) 540ndash59

Groen D Hetherington J Carver H B Nash R

W Bernabeu M O and Coveney P V (2013)

Analysing and modelling the performance of the

HemeLB lattice-Boltzmann simulation environ-

ment Journal of Computational Science 4(5) 412ndash

22

Hetherington J (2017) Question about data capacity

and use at UCL Response to Melissa Terras via email

27 January 2017

Hettrick S (2016) A not-so-brief history of Research

Software Engineers Software Sustainability Institute

httpswwwsoftwareacukblog2016-08-19-not-so-

brief-history-research-software-engineers

Huber M (2007) The Old Bailey Proceedings 1674-1834

Evaluating and annotating a corpus of 18th - and 19th-

century spoken English Studies in Variation Contacts and

Change in English 1 Annotating Variation and Change

httpwwwhelsinkifivariengseriesvolumes01huber

Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx

Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30

Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86

Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books

Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13

Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets

Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne

Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics

Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress

Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81

Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30

Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 9 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 5: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

straightforward and resulted in a text file being de-livered to the Humanities Scholars which they couldthen lsquoclose readrsquo to analyse each instance of thesearch term within a given page of the book in thecorpus Analysis in this case entails finding instancesof the search term in question however there arefurther possibilities that can interrogate the data setfurther in procedurally and methodologically novelways We present here two more ambitious casestudies that allowed for further visualization andanalysis

41 Case Study 1 history of medicineDuke-Williams is a senior lecturer in DigitalInformation Studies in the Department ofInformation Studies at UCL13 and his researchinterests include the presentation of spatial dataand dissemination of demographic data and thepast present and future of demographic data

capture in the UK Visualization of these kinds ofdata can be used to explore issues around the spreadof diseases and the research questions were howdoes the occurrence of diseases in published litera-ture compare to known epidemics in the 19th cen-tury Can we see any correlation between theoccurrence of infectious diseases in society and ref-erence to these diseases in both fiction and non-fiction

Variations in the number of mentions of cholera(Fig 1 continuous black line) were compared torecorded epidemics (shaded bars on Fig 1)A sharp rise in mentions coincides with the firstcholera epidemic in the UK of 1831ndash32 a similarbut less pronounced rise is coincident with the1848ndash49 epidemic A more volatile pattern of men-tions is observed after this point subsequent spikesmay be associated with epidemics within andbeyond the UK or may be less directly related to

0

05

1

15

2

25

3

1790 1810 1830 1850 1870 1890 1910

Num

ber

of re

fere

nces

per

100

0 w

ords

Date of Publication

Disease References

year_cholera_epidemics Whooping Cholera Consumpon Measles

Fig 1 A search for mentions of various infectious diseases (cholera whooping cough consumption and measles)across the 60000-book data set We compared the profound spikes for cholera in the data set with known dataregarding epidemics in the UK (Chadwick 1842 Wall 1893) which appear as the bars on the graph showing arelationship between the first major UK outbreak of cholera and its appearance within the written record of thetime (in 1831ndash32) and again with the second UK epidemic (1848ndash49) Later outbreaks (1853ndash54 and 1863) do notsee this same correlation There are further pronounced spikes for mentions of cholera in the1870s and 1880s these arenot associated with UK epidemics but there were outbreaks in the USA and elsewhere Identifying the texts that refer tothese outbreaks allows us to look more closely at these clusters and to understand the relationship between publichealth epidemiology and the published historical record

M Terras et al

4 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease

This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets

42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14

Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale

Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the

changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections

To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries

First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes

Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 5 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration

Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works

Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject

5 Infrastructuralrecommendations

From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of

0

25

50

75

100

7505002500

Imag

e si

ze (

of p

age)

Page number

Illustration Area by Page for Book ID 002322267

002322267_1 002322267_2

Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume

M Terras et al

6 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)

The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future

Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be

presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)

A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services

Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 7 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward

6 Conclusion

We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities

(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences

(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data

Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and

deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity

AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance

ReferencesAtkins D E Borgman C L Bindhoff N Ellisman

M Felman S Foster I and Heck A (2010)

lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils

UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-

for-the-transformative-enhancement-of-research-and-

innovation

Bates M J (1996) The Getty end-user online searching

project in the humanities report no 6 overview and

conclusions College and Research Libraries 57(6)514ndash23

Bedi S and Walde C (2016) Transforming roles

Canadian academic librarians embedded in faculty re-

search projects College and Research Libraries http

M Terras et al

8 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

crlacrlorgcontentearly20160322crl16-871

abstract

Bradigan P S and Mularski C A (1989) End-user

searching in a medical school curriculum an evaluated

modular approach Bulletin of the Medical Library

Association 77(4) 348ndash56

Brettle A Maden-Jenkins M Anderson L McNally

R Pratchett T Tancock J Thornton D and

Webb A (2011) Evaluating clinical librarian services

A systematic review Health Information and Libraries

Journal 28(1) 3ndash22

Burke J and Tumbleson B (2016) LMS embedded li-

brarianship and the educational role of librarians

Library Technology Reports 52(2) 5ndash9

Chadwick E (1842) Report on the sanitary condition of

the labouring population of Great Britain supplemen-

tary report on the results of special inquiry into the

practice of interment in towns (Vol 1) HM

Stationery Office

Donald D (1996) The Age of Caricature Satirical Prints

in the Reign of George III New Haven Published for the

Paul Mellon Centre for Studies in British Art by Yale

University Press

Farber M and Shoham S (2002) Users end-users

and end-user searchers of online information a his-

torical overview Online Information Review 26(2)

92ndash100

Feliu V and Frazer H (2012) Embedded librarians

teaching legal research as a lawyering skill Journal of

Legal Education 61(4) 540ndash59

Groen D Hetherington J Carver H B Nash R

W Bernabeu M O and Coveney P V (2013)

Analysing and modelling the performance of the

HemeLB lattice-Boltzmann simulation environ-

ment Journal of Computational Science 4(5) 412ndash

22

Hetherington J (2017) Question about data capacity

and use at UCL Response to Melissa Terras via email

27 January 2017

Hettrick S (2016) A not-so-brief history of Research

Software Engineers Software Sustainability Institute

httpswwwsoftwareacukblog2016-08-19-not-so-

brief-history-research-software-engineers

Huber M (2007) The Old Bailey Proceedings 1674-1834

Evaluating and annotating a corpus of 18th - and 19th-

century spoken English Studies in Variation Contacts and

Change in English 1 Annotating Variation and Change

httpwwwhelsinkifivariengseriesvolumes01huber

Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx

Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30

Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86

Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books

Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13

Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets

Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne

Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics

Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress

Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81

Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30

Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 9 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 6: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease

This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets

42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14

Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale

Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the

changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections

To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries

First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes

Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 5 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration

Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works

Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject

5 Infrastructuralrecommendations

From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of

0

25

50

75

100

7505002500

Imag

e si

ze (

of p

age)

Page number

Illustration Area by Page for Book ID 002322267

002322267_1 002322267_2

Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume

M Terras et al

6 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)

The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future

Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be

presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)

A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services

Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 7 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward

6 Conclusion

We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities

(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences

(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data

Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and

deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity

AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance

ReferencesAtkins D E Borgman C L Bindhoff N Ellisman

M Felman S Foster I and Heck A (2010)

lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils

UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-

for-the-transformative-enhancement-of-research-and-

innovation

Bates M J (1996) The Getty end-user online searching

project in the humanities report no 6 overview and

conclusions College and Research Libraries 57(6)514ndash23

Bedi S and Walde C (2016) Transforming roles

Canadian academic librarians embedded in faculty re-

search projects College and Research Libraries http

M Terras et al

8 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

crlacrlorgcontentearly20160322crl16-871

abstract

Bradigan P S and Mularski C A (1989) End-user

searching in a medical school curriculum an evaluated

modular approach Bulletin of the Medical Library

Association 77(4) 348ndash56

Brettle A Maden-Jenkins M Anderson L McNally

R Pratchett T Tancock J Thornton D and

Webb A (2011) Evaluating clinical librarian services

A systematic review Health Information and Libraries

Journal 28(1) 3ndash22

Burke J and Tumbleson B (2016) LMS embedded li-

brarianship and the educational role of librarians

Library Technology Reports 52(2) 5ndash9

Chadwick E (1842) Report on the sanitary condition of

the labouring population of Great Britain supplemen-

tary report on the results of special inquiry into the

practice of interment in towns (Vol 1) HM

Stationery Office

Donald D (1996) The Age of Caricature Satirical Prints

in the Reign of George III New Haven Published for the

Paul Mellon Centre for Studies in British Art by Yale

University Press

Farber M and Shoham S (2002) Users end-users

and end-user searchers of online information a his-

torical overview Online Information Review 26(2)

92ndash100

Feliu V and Frazer H (2012) Embedded librarians

teaching legal research as a lawyering skill Journal of

Legal Education 61(4) 540ndash59

Groen D Hetherington J Carver H B Nash R

W Bernabeu M O and Coveney P V (2013)

Analysing and modelling the performance of the

HemeLB lattice-Boltzmann simulation environ-

ment Journal of Computational Science 4(5) 412ndash

22

Hetherington J (2017) Question about data capacity

and use at UCL Response to Melissa Terras via email

27 January 2017

Hettrick S (2016) A not-so-brief history of Research

Software Engineers Software Sustainability Institute

httpswwwsoftwareacukblog2016-08-19-not-so-

brief-history-research-software-engineers

Huber M (2007) The Old Bailey Proceedings 1674-1834

Evaluating and annotating a corpus of 18th - and 19th-

century spoken English Studies in Variation Contacts and

Change in English 1 Annotating Variation and Change

httpwwwhelsinkifivariengseriesvolumes01huber

Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx

Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30

Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86

Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books

Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13

Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets

Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne

Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics

Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress

Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81

Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30

Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 9 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 7: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration

Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works

Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject

5 Infrastructuralrecommendations

From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of

0

25

50

75

100

7505002500

Imag

e si

ze (

of p

age)

Page number

Illustration Area by Page for Book ID 002322267

002322267_1 002322267_2

Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume

M Terras et al

6 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)

The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future

Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be

presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)

A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services

Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 7 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward

6 Conclusion

We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities

(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences

(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data

Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and

deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity

AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance

ReferencesAtkins D E Borgman C L Bindhoff N Ellisman

M Felman S Foster I and Heck A (2010)

lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils

UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-

for-the-transformative-enhancement-of-research-and-

innovation

Bates M J (1996) The Getty end-user online searching

project in the humanities report no 6 overview and

conclusions College and Research Libraries 57(6)514ndash23

Bedi S and Walde C (2016) Transforming roles

Canadian academic librarians embedded in faculty re-

search projects College and Research Libraries http

M Terras et al

8 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

crlacrlorgcontentearly20160322crl16-871

abstract

Bradigan P S and Mularski C A (1989) End-user

searching in a medical school curriculum an evaluated

modular approach Bulletin of the Medical Library

Association 77(4) 348ndash56

Brettle A Maden-Jenkins M Anderson L McNally

R Pratchett T Tancock J Thornton D and

Webb A (2011) Evaluating clinical librarian services

A systematic review Health Information and Libraries

Journal 28(1) 3ndash22

Burke J and Tumbleson B (2016) LMS embedded li-

brarianship and the educational role of librarians

Library Technology Reports 52(2) 5ndash9

Chadwick E (1842) Report on the sanitary condition of

the labouring population of Great Britain supplemen-

tary report on the results of special inquiry into the

practice of interment in towns (Vol 1) HM

Stationery Office

Donald D (1996) The Age of Caricature Satirical Prints

in the Reign of George III New Haven Published for the

Paul Mellon Centre for Studies in British Art by Yale

University Press

Farber M and Shoham S (2002) Users end-users

and end-user searchers of online information a his-

torical overview Online Information Review 26(2)

92ndash100

Feliu V and Frazer H (2012) Embedded librarians

teaching legal research as a lawyering skill Journal of

Legal Education 61(4) 540ndash59

Groen D Hetherington J Carver H B Nash R

W Bernabeu M O and Coveney P V (2013)

Analysing and modelling the performance of the

HemeLB lattice-Boltzmann simulation environ-

ment Journal of Computational Science 4(5) 412ndash

22

Hetherington J (2017) Question about data capacity

and use at UCL Response to Melissa Terras via email

27 January 2017

Hettrick S (2016) A not-so-brief history of Research

Software Engineers Software Sustainability Institute

httpswwwsoftwareacukblog2016-08-19-not-so-

brief-history-research-software-engineers

Huber M (2007) The Old Bailey Proceedings 1674-1834

Evaluating and annotating a corpus of 18th - and 19th-

century spoken English Studies in Variation Contacts and

Change in English 1 Annotating Variation and Change

httpwwwhelsinkifivariengseriesvolumes01huber

Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx

Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30

Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86

Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books

Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13

Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets

Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne

Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics

Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress

Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81

Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30

Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 9 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 8: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)

The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future

Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be

presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)

A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services

Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 7 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward

6 Conclusion

We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities

(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences

(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data

Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and

deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity

AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance

ReferencesAtkins D E Borgman C L Bindhoff N Ellisman

M Felman S Foster I and Heck A (2010)

lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils

UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-

for-the-transformative-enhancement-of-research-and-

innovation

Bates M J (1996) The Getty end-user online searching

project in the humanities report no 6 overview and

conclusions College and Research Libraries 57(6)514ndash23

Bedi S and Walde C (2016) Transforming roles

Canadian academic librarians embedded in faculty re-

search projects College and Research Libraries http

M Terras et al

8 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

crlacrlorgcontentearly20160322crl16-871

abstract

Bradigan P S and Mularski C A (1989) End-user

searching in a medical school curriculum an evaluated

modular approach Bulletin of the Medical Library

Association 77(4) 348ndash56

Brettle A Maden-Jenkins M Anderson L McNally

R Pratchett T Tancock J Thornton D and

Webb A (2011) Evaluating clinical librarian services

A systematic review Health Information and Libraries

Journal 28(1) 3ndash22

Burke J and Tumbleson B (2016) LMS embedded li-

brarianship and the educational role of librarians

Library Technology Reports 52(2) 5ndash9

Chadwick E (1842) Report on the sanitary condition of

the labouring population of Great Britain supplemen-

tary report on the results of special inquiry into the

practice of interment in towns (Vol 1) HM

Stationery Office

Donald D (1996) The Age of Caricature Satirical Prints

in the Reign of George III New Haven Published for the

Paul Mellon Centre for Studies in British Art by Yale

University Press

Farber M and Shoham S (2002) Users end-users

and end-user searchers of online information a his-

torical overview Online Information Review 26(2)

92ndash100

Feliu V and Frazer H (2012) Embedded librarians

teaching legal research as a lawyering skill Journal of

Legal Education 61(4) 540ndash59

Groen D Hetherington J Carver H B Nash R

W Bernabeu M O and Coveney P V (2013)

Analysing and modelling the performance of the

HemeLB lattice-Boltzmann simulation environ-

ment Journal of Computational Science 4(5) 412ndash

22

Hetherington J (2017) Question about data capacity

and use at UCL Response to Melissa Terras via email

27 January 2017

Hettrick S (2016) A not-so-brief history of Research

Software Engineers Software Sustainability Institute

httpswwwsoftwareacukblog2016-08-19-not-so-

brief-history-research-software-engineers

Huber M (2007) The Old Bailey Proceedings 1674-1834

Evaluating and annotating a corpus of 18th - and 19th-

century spoken English Studies in Variation Contacts and

Change in English 1 Annotating Variation and Change

httpwwwhelsinkifivariengseriesvolumes01huber

Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx

Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30

Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86

Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books

Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13

Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets

Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne

Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics

Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress

Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81

Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30

Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 9 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 9: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward

6 Conclusion

We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities

(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences

(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data

Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and

deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity

AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance

ReferencesAtkins D E Borgman C L Bindhoff N Ellisman

M Felman S Foster I and Heck A (2010)

lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils

UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-

for-the-transformative-enhancement-of-research-and-

innovation

Bates M J (1996) The Getty end-user online searching

project in the humanities report no 6 overview and

conclusions College and Research Libraries 57(6)514ndash23

Bedi S and Walde C (2016) Transforming roles

Canadian academic librarians embedded in faculty re-

search projects College and Research Libraries http

M Terras et al

8 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

crlacrlorgcontentearly20160322crl16-871

abstract

Bradigan P S and Mularski C A (1989) End-user

searching in a medical school curriculum an evaluated

modular approach Bulletin of the Medical Library

Association 77(4) 348ndash56

Brettle A Maden-Jenkins M Anderson L McNally

R Pratchett T Tancock J Thornton D and

Webb A (2011) Evaluating clinical librarian services

A systematic review Health Information and Libraries

Journal 28(1) 3ndash22

Burke J and Tumbleson B (2016) LMS embedded li-

brarianship and the educational role of librarians

Library Technology Reports 52(2) 5ndash9

Chadwick E (1842) Report on the sanitary condition of

the labouring population of Great Britain supplemen-

tary report on the results of special inquiry into the

practice of interment in towns (Vol 1) HM

Stationery Office

Donald D (1996) The Age of Caricature Satirical Prints

in the Reign of George III New Haven Published for the

Paul Mellon Centre for Studies in British Art by Yale

University Press

Farber M and Shoham S (2002) Users end-users

and end-user searchers of online information a his-

torical overview Online Information Review 26(2)

92ndash100

Feliu V and Frazer H (2012) Embedded librarians

teaching legal research as a lawyering skill Journal of

Legal Education 61(4) 540ndash59

Groen D Hetherington J Carver H B Nash R

W Bernabeu M O and Coveney P V (2013)

Analysing and modelling the performance of the

HemeLB lattice-Boltzmann simulation environ-

ment Journal of Computational Science 4(5) 412ndash

22

Hetherington J (2017) Question about data capacity

and use at UCL Response to Melissa Terras via email

27 January 2017

Hettrick S (2016) A not-so-brief history of Research

Software Engineers Software Sustainability Institute

httpswwwsoftwareacukblog2016-08-19-not-so-

brief-history-research-software-engineers

Huber M (2007) The Old Bailey Proceedings 1674-1834

Evaluating and annotating a corpus of 18th - and 19th-

century spoken English Studies in Variation Contacts and

Change in English 1 Annotating Variation and Change

httpwwwhelsinkifivariengseriesvolumes01huber

Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx

Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30

Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86

Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books

Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13

Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets

Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne

Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics

Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress

Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81

Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30

Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 9 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 10: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

crlacrlorgcontentearly20160322crl16-871

abstract

Bradigan P S and Mularski C A (1989) End-user

searching in a medical school curriculum an evaluated

modular approach Bulletin of the Medical Library

Association 77(4) 348ndash56

Brettle A Maden-Jenkins M Anderson L McNally

R Pratchett T Tancock J Thornton D and

Webb A (2011) Evaluating clinical librarian services

A systematic review Health Information and Libraries

Journal 28(1) 3ndash22

Burke J and Tumbleson B (2016) LMS embedded li-

brarianship and the educational role of librarians

Library Technology Reports 52(2) 5ndash9

Chadwick E (1842) Report on the sanitary condition of

the labouring population of Great Britain supplemen-

tary report on the results of special inquiry into the

practice of interment in towns (Vol 1) HM

Stationery Office

Donald D (1996) The Age of Caricature Satirical Prints

in the Reign of George III New Haven Published for the

Paul Mellon Centre for Studies in British Art by Yale

University Press

Farber M and Shoham S (2002) Users end-users

and end-user searchers of online information a his-

torical overview Online Information Review 26(2)

92ndash100

Feliu V and Frazer H (2012) Embedded librarians

teaching legal research as a lawyering skill Journal of

Legal Education 61(4) 540ndash59

Groen D Hetherington J Carver H B Nash R

W Bernabeu M O and Coveney P V (2013)

Analysing and modelling the performance of the

HemeLB lattice-Boltzmann simulation environ-

ment Journal of Computational Science 4(5) 412ndash

22

Hetherington J (2017) Question about data capacity

and use at UCL Response to Melissa Terras via email

27 January 2017

Hettrick S (2016) A not-so-brief history of Research

Software Engineers Software Sustainability Institute

httpswwwsoftwareacukblog2016-08-19-not-so-

brief-history-research-software-engineers

Huber M (2007) The Old Bailey Proceedings 1674-1834

Evaluating and annotating a corpus of 18th - and 19th-

century spoken English Studies in Variation Contacts and

Change in English 1 Annotating Variation and Change

httpwwwhelsinkifivariengseriesvolumes01huber

Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx

Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30

Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86

Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books

Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13

Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets

Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne

Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics

Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress

Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81

Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30

Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 9 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 11: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

Rethlefsen M Farrell A Osterhaus Trzasko L and

Brigham T (2015) Librarian co-authors correlated

with higher quality reported search strategies in general

internal medicine systematic reviews Journal of Clinical

Epidemiology 68(6) 617ndash26

Rockwell G (2017) Question about maximum capacity

of Voyant tools Response to Melissa Terras via email 27

January 2017

Rockwell G and Sinclair S 2016 Hermeneutica

Computer-Assisted Interpretation in the Humanities

Cambridge MIT Press

Roper T (2015) The impact of the clinical librarian a

review Journal of EAHIL 11(4) 19ndash22

Schmidt B (2014) Shipping maps and how states see

Sapping Attention Blog httpsappingattentionblogspot

couk201403shipping-maps-and-how-states-seehtml

Sinclair S (2017) Question about maximum capacity of

Voyant tools Response to Melissa Terras via email 27

January 2017

Smith D A Cordell R and Dillon E M (2013)

Infectious texts modeling text reuse in nineteenth-

century newspapers In IEEE International Conference

on Big Data October 6-9 2013 Santa Clara

California IEEE pp 86ndash94 101109

BigData20136691675

Smith D Cordell R and Mullen A (2015)

Computational methods for uncovering reprinted

texts in Antebellum newspapers American Literary

History 27(3)

Starr S S and Renford B L (1987) Evaluation of a

program to teach health professionals to search

MEDLINE Bulletin of the Medical Library Association

75(3) 193ndash201

Stijnman A (2012) Engraving and Etching 1400-2000 A

History of the Development of Manual Intaglio

Printmaking Processes London Houten Netherlands

Archetype Publications in association with HES and

DE GRAAF Publishers

Straus S Eisinga A and Sackett D (2016) What

drove the evidence cart Bringing the library to the

bedside Journal of the Royal Society of Medicine

109(6) 241ndash7

Streipe T and Talley M (2013) Embedded librarian-

ship In Kroski E (ed) Law Librarianship in the Digital

Age Plymouth Scarecrow pp 13ndash30

Terras M (2009) Potentials and Problems in Applying High

Performance Computing for Research in the Arts and

Humanities Researching e-Science Analysis of Census

Holdings Digital Humanities Quarterly httpwwwdigi-

talhumanitiesorgdhqvol34000070000070html

Terras M (2015) Opening access to collections the

making and using of open digitised cultural content

Online Information Review 39(5) 733ndash52 httpwww

emeraldinsightcomdoifull101108OIR-06-2015-0193

Thomas J (2004) Pictorial Victorians The Inscription of

Values in Word and Image Athens Ohio University

Press

Voss A Asgari-Targhi M Procter R and

Fergusson D (2010) Adoption of e-Infrastructure

services configurations of practice Philosophical

Transactions Series A Mathematical Physical and

Engineering Sciences 368 4161ndash76 doi 101098

rsta20100162

Wall A J (1893) Asiatic cholera its history pathology

and modern treatment London H K Lewis

Wittek S Sinclair S and Milner M (2015) DREaM

Distant Reading Early Modernity Digital Humanities

2015 Global Digital Humanities Sydney Australia 29

Junendash3 July 2015 httpdh2015orgabstractsxml

WITTEK_Stephen_DREaM__Distant_Reading_Early_

ModerWITTEK_Stephen_DREaM__Distant_Reading_

Early_Modernityhtml

Wynne M (2013) The role of Clarin in digital trans-

formations in the humanities International Journal of

Humanities and Arts Computing 7(1-2) 89

Wynne M (2015) Big Data and Digital Transformations in

the Humanities are we there yet Textual Digital

Humanities and Social Sciences Data Aberdeen 21ndash22

September 2015 httpwwwslidesharenetmartinwynne

big-data-and-digital-transformations-in-the-humanities

Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free

and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http

nactemacukhom) or Language of the State of the

Union (httpwwwtheatlanticcompoliticsarchive

201501the-language-of-the-state-of-the-union

384575)7 For example CLARIN (httpclarineu) and

DARIAH (httpswwwdariaheu)

M Terras et al

10 of 11 Digital Scholarship in the Humanities 2017

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017

Page 12: Edinburgh Research Explorer · Williams, O & Farquhar, A 2017, 'Enabling complex analysis of large-scale digital collections: Humanities research, high-performance computing, and

8 The British Library has various digital data sets

including (but not limited to) 7 million pages of his-

toric newspapers 1 million out of copyright book il-

lustrations 100000s of scientific articles text from

over 60000 books 1000s of UK theses and various

digitized medieval manuscripts We chose here just

one of its large-scale data sets to work with in this

pilot phase For the terms under which the British

Library makes collections available see httpwww

blukaboutustermscopyright9 The books cover a wide range of subject areas includ-

ing philosophy history poetry and literature most

are in English For a full list of metadata for this book

collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the

HPC facilities available at UCL for researchers see

httpwwwuclacukisdservicesresearch-itresearch-

computing11 It is difficult to put a figure on what a manageable text

file size would be for a researcher if we take that as

being able to search through it on their own machine

without further technical assistance This is dependent

on access to processing power which varies consider-

ably between desktop and laptop computers and the

complexity of the search the researcher is intending to

carry out as well as the format and granularity of the

source text Most humanities researchers should now

be able to comfortably manipulate text files of around

100 MB if they have access to a modern machine the

amount of text those files will contain is dependent on

how complex the file formats are Programs and tools

available to assist in text analysis include Voyant Tools

(httpsvoyant-toolsorg) and R (httpswwwr-pro-

jectorg) The upper limit to using server-based

Voyant is 20 MB of text which should correlate to

around 1 million words depending on format

(Sinclair 2017) Voyant Server can be downloaded

and used locally with an upper limit of around

100 MB on a standard PCMac (Rockwell 2017)

Indexing engines such as Lucene (httplucene

apacheorg) can search larger amounts of text how-

ever visualization of results will require further pro-

cessing and can be complex For further exploration of

lsquocomputer-assisted interpretation in the humanitiesrsquo

(particularly using Voyant) see Rockwell and

Sinclair (2016) and an example of the approach of

generating manageable smaller corpora from a large-

scale Early Modern Sources can be found in Wittek et

al (2015)12 All code data visualizations and other outputs from

this pilot project are freely available at httpsgithub

comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-

liam-finley15 For more on the UKrsquos eScience infrastructure see the

work of the eScience Institute httpwwwesiacuk

Plan-Europe is the Platform of National eScience

Centres in Europe (httpplan-europeeu) In the

USA the equivalent of eScience is known as

Cyberinfrastructure see the National Science

Foundationrsquos guides httpwwwnsfgovdivindex

jspdivfrac14ACI

16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s

when computerized searching of specialist databases

was first available (Farber and Shoham 2002

Markey 2007a 2007b) At that time evaluations car-

ried out on programs to cascade train end users via

librarians reported increased user uptake and effi-

ciency as compared to direct end-user training by

database providers (see Starr and Renford 1987

Bradigan and Mularski 1989 Ikeda and Schwartz

1992 on the library-based training of medical staff to

use MEDLINE) Significantly studies showed that

while end users wanted to be able to search databases

themselves they also desired access to searches

mediated by librarians andor a team approach to

search utilizing the end userrsquos specific knowledge of

the language and jargon of their discipline and the

information professionalrsquos understanding of

BOOLEAN and other advanced search techniques

(Ludwig et al 1988 Lancaster et al 1994 Bates

1996) The move towards embedded librarians in

law practices (Feliu and Frazer 2012 Streipe and

Talley 2013 Murray 2016) and clinical librarians in

hospitals (Brettle et al 2011 Roper 2015 Straus et

al 2016) can be seen to be a modern iteration of a

team approach to search In Higher Education subject

librarians are well-positioned to form part of a research

team using computational methods andor provide

information literacy training that upskills humanists

in computational methods (Rethlefsen et al 2015

Bedi and Walde 2016 Burke and Tumbleson 2016)

Analysis of large-scale digital collections

Digital Scholarship in the Humanities 2017 11 of 11

Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017