edinburgh research explorer · williams, o & farquhar, a 2017, 'enabling complex analysis...
TRANSCRIPT
Edinburgh Research Explorer
Enabling complex analysis of large-scale digital collections
Citation for published versionTerras M Baker J Hetherington J Beavan D Zaltz Austwick M Welsh A ONeill H Finley W Duke-Williams O amp Farquhar A 2017 Enabling complex analysis of large-scale digital collections Humanitiesresearch high-performance computing and transforming access to British Library digital collections DigitalScholarship in the Humanities httpsdoiorg101093llcfqx020
Digital Object Identifier (DOI)101093llcfqx020
LinkLink to publication record in Edinburgh Research Explorer
Document VersionPublishers PDF also known as Version of record
Published InDigital Scholarship in the Humanities
General rightsCopyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)and or other copyright owners and it is a condition of accessing these publications that users recognise andabide by the legal requirements associated with these rights
Take down policyThe University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorercontent complies with UK legislation If you believe that the public display of this file breaches copyright pleasecontact openaccessedacuk providing details and we will remove access to the work immediately andinvestigate your claim
Download date 07 Jun 2020
Enabling complex analysis of large-scale digital collections humanitiesresearch high-performance com-puting and transforming access toBritish Library digital collections
Melissa Terras
DepartmentofInformationStudiesUniversityCollegeLondonUKand
UCL Centre for Digital Humanities University College London UK
James Baker
School of History Art History and Philosophy University of
Sussex UK
James Hetherington
Research Software Development Group Research IT Services
University College London UK
David Beavan
UCL Centre for Digital Humanities University College London UK
Martin Zaltz Austwick
Centre for Advanced Spatial Analysis University College London UK
Anne Welsh
Department of Information Studies University College London UK
Helen OrsquoNeill
Department of Information Studies University College London
UK The London Library UK
Will Finley
Department of History University of Sheffield UK
Oliver Duke-Williams
Department of Information Studies University College London UK
Adam Farquhar
Digital Scholarship British Library UK
Digital Scholarship in the Humanities The Author 2017 Published by Oxford University Press on behalf of EADHThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) which permits unrestricted reuse distribution and reproduction in any medium provided the original work isproperly cited
1 of 11
doi101093llcfqx020Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
AbstractAlthough there has been a drive in the cultural heritage sector to provide large-scale open data sets for researchers we have not seen a commensurate rise inhumanities researchers undertaking complex analysis of these data sets for theirown research purposes This article reports on a pilot project at UniversityCollege London working in collaboration with the British Library to scopeout how best high-performance computing facilities can be used to facilitatethe needs of researchers in the humanities Using institutional data-processingframeworks routinely used to support scientific research we assisted four huma-nities researchers in analysing 60000 digitized books and we present two result-ing case studies here This research allowed us to identify infrastructural andprocedural barriers and make recommendations on resource allocation to bestsupport non-computational researchers in undertaking lsquobig datarsquo research Werecommend that research software engineer capacity can be most efficiently de-ployed in maintaining and supporting data sets while librarians can provide anessential service in running initial routine queries for humanities scholars Atpresent there are too many technical hurdles for most individuals in the huma-nities to consider analysing at scale these increasingly available open data setsand by building on existing frameworks of support from research computing andlibrary services we can best support humanities scholars in developing methodsand approaches to take advantage of these research opportunities
1 Introduction
How best can humanities researchers access andanalyse large-scale digital data sets available frominstitutions in the cultural and heritage sectorWhat barriers remain in place for those from thehumanities wishing to use high-performance com-puting (HPC) to provide insights into historicaldata sets using lsquobig-datarsquo analytical techniquesThis article describes a pilot project that workedin collaboration with non-computationally trainedhumanities researchers to identify and overcomebarriers to complex analysis of large-scale digitalcollections It used institutional university frame-works that routinely support the processing oflarge-scale data sets for research purposes in thesciences The project brought together humanitiesresearchers research software engineers (Hettrick2016) and information professionals from theBritish Library Digital Scholarship Department1
University College London (UCL) Centre forDigital Humanities2 UCL Centre for AdvancedSpatial Analysis3 and UCL Research IT Services(UCL RITS)4 to analyse an open-licenced large-scale data set from the British Library While
useful research results were generated undertakingthis project clarified the technical and proceduralbarriers that exist when humanities researchers at-tempt to utilize computational research infrastruc-tures in the pursuit of their own research questions
2 Overview
The drive in the gallery library archive andmuseum (GLAM) sector towards opening up collec-tions data5 as well as the growth in data publishedby publicly funded research projects means huma-nities researchers have a wealth of large-scale digitalcollections available to them (Lui 2015 Terras2015) Many of these data sets are released underopen licences that permit uninhibited use by anyonewith an Internet collection and modest storage cap-acity A few humanities researchers have exploitedthese resources and their interpretations makeclaims that change our understanding of culturalphenomena (Smith et al 2013 Schmidt 2014Smith et al 2015 Huber 2007 Leetaru 2015)Nevertheless there remain major barriers to thewidespread uptake of these data sets and related
Correspondence
Melissa Terras UCL
Department of Information
Studies Foster Court
University College London
Gower Street London
WC1E 6BT UK
mterrasuclacuk
M Terras et al
2 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computational approaches by humanities re-searchers which risks diminishing the relevance ofthe humanities in lsquobig datarsquo analysis (Wynne 2015)These barriers include
fragmentation of communities resources andtools
lack of interoperability complexity and incompleteness of heterogeneous
cultural heritage data sets (Terras 2009) and lack of technical skills lsquomainstream researchers
in the humanities and social sciences often donrsquotknow what the new possibilities arersquo (ibid) andseldom have the technical experience to experi-ment (Hughes 2009 Mahony and Pierazzo2012)
A common response to this lack of awarenessand computational skills is to build Web-basedinterfaces to data6 or federated services and infra-structures7 While these interfaces play a positiverole in introducing humanities researchers tolarge-scale digital collections they rarely fulfil thecomplex needs of humanities research which con-stantly questions received approaches and results orallow researchers to tailor analysis without beinglimited by shared assumptions and methods(Wynne 2013)
3 Method
We explored the challenges associated with deploy-ing and working with large-scale digital collectionssuitable for humanities research using a publicdomain digital collection provided by the BritishLibrary8 This circa 60000-book data set covers fic-tion and non-fiction publications from the 17th18th and 19th centuries ormdashseen as datamdash224 GB of compressed ALTO XML that includesboth content (captured using an Optical CharacterRecognition (OCR) process) and the location ofthat content on a page9 Using UCLrsquos centrallyfunded computing facilities10 we worked fromMarchndashJuly 2015 with UCL RITS and a cohort offour humanities researchers (from doctoral candi-dates to mid-career scholars) to ask queries thatcould not be satisfied by search- and discovery-
orientated graphical user interfaces Working in col-laboration we turned their research questions intocomputational queries explored ways in which thereturned data could be visualized and capturedtheir thoughts on the process through semi-struc-tured interviews
4 Results
We successfully ran queries across the data set thattracked linguistic change identified core phrasesplotted the placement of illustrations and mappedlocations mentioned within core texts The semi-structured interviews conducted with non-compu-tationally trained humanities researchers at variousstages during the collaborative work supported fourkey findings First that breaking down a researchquestion into a series of more defined computa-tional queries was time-consuming and challengingSecondly that the iterative nature of this researchmethodology puts pressure on the time taken toexecute queries and that long processing timeswere frustrating Thirdly that full comprehensionof the programming code was not necessary to pro-cess data and use their outputs in research thoughunderstanding the inputs outputs and effects ofparameters was required Fourthly that creatingderived data sets of a size manageable by desktopPCs11 opened up further investigation using estab-lished methods Indeed we found that buildingqueries that generate derived data sets from large-scale digital collections (small enough to be workedon locally with familiar tools) is an effective meansof empowering non-computationally trained huma-nities researchers to develop the skill sets required toundertake complex analysis of humanities data12
Our case studies deepen and add nuance to thesefindings Two of our case studies were interested inlooking at instances of particular words or phrasesin the corpus (for example lsquoprofessorrsquo) or particu-lar combinations of phrases within the corpus(lsquohigher educationrsquo) to identify a particular institu-tion and group of persons across time The require-ments from the researchers were to return thecomplete page of text that surrounded each ex-ample This was found to be technically quite
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 3 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
straightforward and resulted in a text file being de-livered to the Humanities Scholars which they couldthen lsquoclose readrsquo to analyse each instance of thesearch term within a given page of the book in thecorpus Analysis in this case entails finding instancesof the search term in question however there arefurther possibilities that can interrogate the data setfurther in procedurally and methodologically novelways We present here two more ambitious casestudies that allowed for further visualization andanalysis
41 Case Study 1 history of medicineDuke-Williams is a senior lecturer in DigitalInformation Studies in the Department ofInformation Studies at UCL13 and his researchinterests include the presentation of spatial dataand dissemination of demographic data and thepast present and future of demographic data
capture in the UK Visualization of these kinds ofdata can be used to explore issues around the spreadof diseases and the research questions were howdoes the occurrence of diseases in published litera-ture compare to known epidemics in the 19th cen-tury Can we see any correlation between theoccurrence of infectious diseases in society and ref-erence to these diseases in both fiction and non-fiction
Variations in the number of mentions of cholera(Fig 1 continuous black line) were compared torecorded epidemics (shaded bars on Fig 1)A sharp rise in mentions coincides with the firstcholera epidemic in the UK of 1831ndash32 a similarbut less pronounced rise is coincident with the1848ndash49 epidemic A more volatile pattern of men-tions is observed after this point subsequent spikesmay be associated with epidemics within andbeyond the UK or may be less directly related to
0
05
1
15
2
25
3
1790 1810 1830 1850 1870 1890 1910
Num
ber
of re
fere
nces
per
100
0 w
ords
Date of Publication
Disease References
year_cholera_epidemics Whooping Cholera Consumpon Measles
Fig 1 A search for mentions of various infectious diseases (cholera whooping cough consumption and measles)across the 60000-book data set We compared the profound spikes for cholera in the data set with known dataregarding epidemics in the UK (Chadwick 1842 Wall 1893) which appear as the bars on the graph showing arelationship between the first major UK outbreak of cholera and its appearance within the written record of thetime (in 1831ndash32) and again with the second UK epidemic (1848ndash49) Later outbreaks (1853ndash54 and 1863) do notsee this same correlation There are further pronounced spikes for mentions of cholera in the1870s and 1880s these arenot associated with UK epidemics but there were outbreaks in the USA and elsewhere Identifying the texts that refer tothese outbreaks allows us to look more closely at these clusters and to understand the relationship between publichealth epidemiology and the published historical record
M Terras et al
4 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease
This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets
42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14
Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale
Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the
changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections
To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries
First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes
Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 5 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration
Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works
Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject
5 Infrastructuralrecommendations
From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of
0
25
50
75
100
7505002500
Imag
e si
ze (
of p
age)
Page number
Illustration Area by Page for Book ID 002322267
002322267_1 002322267_2
Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume
M Terras et al
6 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)
The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future
Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be
presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)
A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services
Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 7 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward
6 Conclusion
We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities
(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences
(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data
Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and
deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity
AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance
ReferencesAtkins D E Borgman C L Bindhoff N Ellisman
M Felman S Foster I and Heck A (2010)
lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils
UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-
for-the-transformative-enhancement-of-research-and-
innovation
Bates M J (1996) The Getty end-user online searching
project in the humanities report no 6 overview and
conclusions College and Research Libraries 57(6)514ndash23
Bedi S and Walde C (2016) Transforming roles
Canadian academic librarians embedded in faculty re-
search projects College and Research Libraries http
M Terras et al
8 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
crlacrlorgcontentearly20160322crl16-871
abstract
Bradigan P S and Mularski C A (1989) End-user
searching in a medical school curriculum an evaluated
modular approach Bulletin of the Medical Library
Association 77(4) 348ndash56
Brettle A Maden-Jenkins M Anderson L McNally
R Pratchett T Tancock J Thornton D and
Webb A (2011) Evaluating clinical librarian services
A systematic review Health Information and Libraries
Journal 28(1) 3ndash22
Burke J and Tumbleson B (2016) LMS embedded li-
brarianship and the educational role of librarians
Library Technology Reports 52(2) 5ndash9
Chadwick E (1842) Report on the sanitary condition of
the labouring population of Great Britain supplemen-
tary report on the results of special inquiry into the
practice of interment in towns (Vol 1) HM
Stationery Office
Donald D (1996) The Age of Caricature Satirical Prints
in the Reign of George III New Haven Published for the
Paul Mellon Centre for Studies in British Art by Yale
University Press
Farber M and Shoham S (2002) Users end-users
and end-user searchers of online information a his-
torical overview Online Information Review 26(2)
92ndash100
Feliu V and Frazer H (2012) Embedded librarians
teaching legal research as a lawyering skill Journal of
Legal Education 61(4) 540ndash59
Groen D Hetherington J Carver H B Nash R
W Bernabeu M O and Coveney P V (2013)
Analysing and modelling the performance of the
HemeLB lattice-Boltzmann simulation environ-
ment Journal of Computational Science 4(5) 412ndash
22
Hetherington J (2017) Question about data capacity
and use at UCL Response to Melissa Terras via email
27 January 2017
Hettrick S (2016) A not-so-brief history of Research
Software Engineers Software Sustainability Institute
httpswwwsoftwareacukblog2016-08-19-not-so-
brief-history-research-software-engineers
Huber M (2007) The Old Bailey Proceedings 1674-1834
Evaluating and annotating a corpus of 18th - and 19th-
century spoken English Studies in Variation Contacts and
Change in English 1 Annotating Variation and Change
httpwwwhelsinkifivariengseriesvolumes01huber
Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx
Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30
Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86
Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books
Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13
Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets
Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne
Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics
Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress
Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81
Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30
Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 9 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Enabling complex analysis of large-scale digital collections humanitiesresearch high-performance com-puting and transforming access toBritish Library digital collections
Melissa Terras
DepartmentofInformationStudiesUniversityCollegeLondonUKand
UCL Centre for Digital Humanities University College London UK
James Baker
School of History Art History and Philosophy University of
Sussex UK
James Hetherington
Research Software Development Group Research IT Services
University College London UK
David Beavan
UCL Centre for Digital Humanities University College London UK
Martin Zaltz Austwick
Centre for Advanced Spatial Analysis University College London UK
Anne Welsh
Department of Information Studies University College London UK
Helen OrsquoNeill
Department of Information Studies University College London
UK The London Library UK
Will Finley
Department of History University of Sheffield UK
Oliver Duke-Williams
Department of Information Studies University College London UK
Adam Farquhar
Digital Scholarship British Library UK
Digital Scholarship in the Humanities The Author 2017 Published by Oxford University Press on behalf of EADHThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) which permits unrestricted reuse distribution and reproduction in any medium provided the original work isproperly cited
1 of 11
doi101093llcfqx020Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
AbstractAlthough there has been a drive in the cultural heritage sector to provide large-scale open data sets for researchers we have not seen a commensurate rise inhumanities researchers undertaking complex analysis of these data sets for theirown research purposes This article reports on a pilot project at UniversityCollege London working in collaboration with the British Library to scopeout how best high-performance computing facilities can be used to facilitatethe needs of researchers in the humanities Using institutional data-processingframeworks routinely used to support scientific research we assisted four huma-nities researchers in analysing 60000 digitized books and we present two result-ing case studies here This research allowed us to identify infrastructural andprocedural barriers and make recommendations on resource allocation to bestsupport non-computational researchers in undertaking lsquobig datarsquo research Werecommend that research software engineer capacity can be most efficiently de-ployed in maintaining and supporting data sets while librarians can provide anessential service in running initial routine queries for humanities scholars Atpresent there are too many technical hurdles for most individuals in the huma-nities to consider analysing at scale these increasingly available open data setsand by building on existing frameworks of support from research computing andlibrary services we can best support humanities scholars in developing methodsand approaches to take advantage of these research opportunities
1 Introduction
How best can humanities researchers access andanalyse large-scale digital data sets available frominstitutions in the cultural and heritage sectorWhat barriers remain in place for those from thehumanities wishing to use high-performance com-puting (HPC) to provide insights into historicaldata sets using lsquobig-datarsquo analytical techniquesThis article describes a pilot project that workedin collaboration with non-computationally trainedhumanities researchers to identify and overcomebarriers to complex analysis of large-scale digitalcollections It used institutional university frame-works that routinely support the processing oflarge-scale data sets for research purposes in thesciences The project brought together humanitiesresearchers research software engineers (Hettrick2016) and information professionals from theBritish Library Digital Scholarship Department1
University College London (UCL) Centre forDigital Humanities2 UCL Centre for AdvancedSpatial Analysis3 and UCL Research IT Services(UCL RITS)4 to analyse an open-licenced large-scale data set from the British Library While
useful research results were generated undertakingthis project clarified the technical and proceduralbarriers that exist when humanities researchers at-tempt to utilize computational research infrastruc-tures in the pursuit of their own research questions
2 Overview
The drive in the gallery library archive andmuseum (GLAM) sector towards opening up collec-tions data5 as well as the growth in data publishedby publicly funded research projects means huma-nities researchers have a wealth of large-scale digitalcollections available to them (Lui 2015 Terras2015) Many of these data sets are released underopen licences that permit uninhibited use by anyonewith an Internet collection and modest storage cap-acity A few humanities researchers have exploitedthese resources and their interpretations makeclaims that change our understanding of culturalphenomena (Smith et al 2013 Schmidt 2014Smith et al 2015 Huber 2007 Leetaru 2015)Nevertheless there remain major barriers to thewidespread uptake of these data sets and related
Correspondence
Melissa Terras UCL
Department of Information
Studies Foster Court
University College London
Gower Street London
WC1E 6BT UK
mterrasuclacuk
M Terras et al
2 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computational approaches by humanities re-searchers which risks diminishing the relevance ofthe humanities in lsquobig datarsquo analysis (Wynne 2015)These barriers include
fragmentation of communities resources andtools
lack of interoperability complexity and incompleteness of heterogeneous
cultural heritage data sets (Terras 2009) and lack of technical skills lsquomainstream researchers
in the humanities and social sciences often donrsquotknow what the new possibilities arersquo (ibid) andseldom have the technical experience to experi-ment (Hughes 2009 Mahony and Pierazzo2012)
A common response to this lack of awarenessand computational skills is to build Web-basedinterfaces to data6 or federated services and infra-structures7 While these interfaces play a positiverole in introducing humanities researchers tolarge-scale digital collections they rarely fulfil thecomplex needs of humanities research which con-stantly questions received approaches and results orallow researchers to tailor analysis without beinglimited by shared assumptions and methods(Wynne 2013)
3 Method
We explored the challenges associated with deploy-ing and working with large-scale digital collectionssuitable for humanities research using a publicdomain digital collection provided by the BritishLibrary8 This circa 60000-book data set covers fic-tion and non-fiction publications from the 17th18th and 19th centuries ormdashseen as datamdash224 GB of compressed ALTO XML that includesboth content (captured using an Optical CharacterRecognition (OCR) process) and the location ofthat content on a page9 Using UCLrsquos centrallyfunded computing facilities10 we worked fromMarchndashJuly 2015 with UCL RITS and a cohort offour humanities researchers (from doctoral candi-dates to mid-career scholars) to ask queries thatcould not be satisfied by search- and discovery-
orientated graphical user interfaces Working in col-laboration we turned their research questions intocomputational queries explored ways in which thereturned data could be visualized and capturedtheir thoughts on the process through semi-struc-tured interviews
4 Results
We successfully ran queries across the data set thattracked linguistic change identified core phrasesplotted the placement of illustrations and mappedlocations mentioned within core texts The semi-structured interviews conducted with non-compu-tationally trained humanities researchers at variousstages during the collaborative work supported fourkey findings First that breaking down a researchquestion into a series of more defined computa-tional queries was time-consuming and challengingSecondly that the iterative nature of this researchmethodology puts pressure on the time taken toexecute queries and that long processing timeswere frustrating Thirdly that full comprehensionof the programming code was not necessary to pro-cess data and use their outputs in research thoughunderstanding the inputs outputs and effects ofparameters was required Fourthly that creatingderived data sets of a size manageable by desktopPCs11 opened up further investigation using estab-lished methods Indeed we found that buildingqueries that generate derived data sets from large-scale digital collections (small enough to be workedon locally with familiar tools) is an effective meansof empowering non-computationally trained huma-nities researchers to develop the skill sets required toundertake complex analysis of humanities data12
Our case studies deepen and add nuance to thesefindings Two of our case studies were interested inlooking at instances of particular words or phrasesin the corpus (for example lsquoprofessorrsquo) or particu-lar combinations of phrases within the corpus(lsquohigher educationrsquo) to identify a particular institu-tion and group of persons across time The require-ments from the researchers were to return thecomplete page of text that surrounded each ex-ample This was found to be technically quite
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 3 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
straightforward and resulted in a text file being de-livered to the Humanities Scholars which they couldthen lsquoclose readrsquo to analyse each instance of thesearch term within a given page of the book in thecorpus Analysis in this case entails finding instancesof the search term in question however there arefurther possibilities that can interrogate the data setfurther in procedurally and methodologically novelways We present here two more ambitious casestudies that allowed for further visualization andanalysis
41 Case Study 1 history of medicineDuke-Williams is a senior lecturer in DigitalInformation Studies in the Department ofInformation Studies at UCL13 and his researchinterests include the presentation of spatial dataand dissemination of demographic data and thepast present and future of demographic data
capture in the UK Visualization of these kinds ofdata can be used to explore issues around the spreadof diseases and the research questions were howdoes the occurrence of diseases in published litera-ture compare to known epidemics in the 19th cen-tury Can we see any correlation between theoccurrence of infectious diseases in society and ref-erence to these diseases in both fiction and non-fiction
Variations in the number of mentions of cholera(Fig 1 continuous black line) were compared torecorded epidemics (shaded bars on Fig 1)A sharp rise in mentions coincides with the firstcholera epidemic in the UK of 1831ndash32 a similarbut less pronounced rise is coincident with the1848ndash49 epidemic A more volatile pattern of men-tions is observed after this point subsequent spikesmay be associated with epidemics within andbeyond the UK or may be less directly related to
0
05
1
15
2
25
3
1790 1810 1830 1850 1870 1890 1910
Num
ber
of re
fere
nces
per
100
0 w
ords
Date of Publication
Disease References
year_cholera_epidemics Whooping Cholera Consumpon Measles
Fig 1 A search for mentions of various infectious diseases (cholera whooping cough consumption and measles)across the 60000-book data set We compared the profound spikes for cholera in the data set with known dataregarding epidemics in the UK (Chadwick 1842 Wall 1893) which appear as the bars on the graph showing arelationship between the first major UK outbreak of cholera and its appearance within the written record of thetime (in 1831ndash32) and again with the second UK epidemic (1848ndash49) Later outbreaks (1853ndash54 and 1863) do notsee this same correlation There are further pronounced spikes for mentions of cholera in the1870s and 1880s these arenot associated with UK epidemics but there were outbreaks in the USA and elsewhere Identifying the texts that refer tothese outbreaks allows us to look more closely at these clusters and to understand the relationship between publichealth epidemiology and the published historical record
M Terras et al
4 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease
This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets
42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14
Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale
Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the
changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections
To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries
First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes
Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 5 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration
Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works
Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject
5 Infrastructuralrecommendations
From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of
0
25
50
75
100
7505002500
Imag
e si
ze (
of p
age)
Page number
Illustration Area by Page for Book ID 002322267
002322267_1 002322267_2
Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume
M Terras et al
6 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)
The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future
Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be
presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)
A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services
Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 7 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward
6 Conclusion
We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities
(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences
(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data
Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and
deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity
AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance
ReferencesAtkins D E Borgman C L Bindhoff N Ellisman
M Felman S Foster I and Heck A (2010)
lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils
UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-
for-the-transformative-enhancement-of-research-and-
innovation
Bates M J (1996) The Getty end-user online searching
project in the humanities report no 6 overview and
conclusions College and Research Libraries 57(6)514ndash23
Bedi S and Walde C (2016) Transforming roles
Canadian academic librarians embedded in faculty re-
search projects College and Research Libraries http
M Terras et al
8 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
crlacrlorgcontentearly20160322crl16-871
abstract
Bradigan P S and Mularski C A (1989) End-user
searching in a medical school curriculum an evaluated
modular approach Bulletin of the Medical Library
Association 77(4) 348ndash56
Brettle A Maden-Jenkins M Anderson L McNally
R Pratchett T Tancock J Thornton D and
Webb A (2011) Evaluating clinical librarian services
A systematic review Health Information and Libraries
Journal 28(1) 3ndash22
Burke J and Tumbleson B (2016) LMS embedded li-
brarianship and the educational role of librarians
Library Technology Reports 52(2) 5ndash9
Chadwick E (1842) Report on the sanitary condition of
the labouring population of Great Britain supplemen-
tary report on the results of special inquiry into the
practice of interment in towns (Vol 1) HM
Stationery Office
Donald D (1996) The Age of Caricature Satirical Prints
in the Reign of George III New Haven Published for the
Paul Mellon Centre for Studies in British Art by Yale
University Press
Farber M and Shoham S (2002) Users end-users
and end-user searchers of online information a his-
torical overview Online Information Review 26(2)
92ndash100
Feliu V and Frazer H (2012) Embedded librarians
teaching legal research as a lawyering skill Journal of
Legal Education 61(4) 540ndash59
Groen D Hetherington J Carver H B Nash R
W Bernabeu M O and Coveney P V (2013)
Analysing and modelling the performance of the
HemeLB lattice-Boltzmann simulation environ-
ment Journal of Computational Science 4(5) 412ndash
22
Hetherington J (2017) Question about data capacity
and use at UCL Response to Melissa Terras via email
27 January 2017
Hettrick S (2016) A not-so-brief history of Research
Software Engineers Software Sustainability Institute
httpswwwsoftwareacukblog2016-08-19-not-so-
brief-history-research-software-engineers
Huber M (2007) The Old Bailey Proceedings 1674-1834
Evaluating and annotating a corpus of 18th - and 19th-
century spoken English Studies in Variation Contacts and
Change in English 1 Annotating Variation and Change
httpwwwhelsinkifivariengseriesvolumes01huber
Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx
Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30
Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86
Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books
Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13
Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets
Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne
Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics
Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress
Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81
Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30
Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 9 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
AbstractAlthough there has been a drive in the cultural heritage sector to provide large-scale open data sets for researchers we have not seen a commensurate rise inhumanities researchers undertaking complex analysis of these data sets for theirown research purposes This article reports on a pilot project at UniversityCollege London working in collaboration with the British Library to scopeout how best high-performance computing facilities can be used to facilitatethe needs of researchers in the humanities Using institutional data-processingframeworks routinely used to support scientific research we assisted four huma-nities researchers in analysing 60000 digitized books and we present two result-ing case studies here This research allowed us to identify infrastructural andprocedural barriers and make recommendations on resource allocation to bestsupport non-computational researchers in undertaking lsquobig datarsquo research Werecommend that research software engineer capacity can be most efficiently de-ployed in maintaining and supporting data sets while librarians can provide anessential service in running initial routine queries for humanities scholars Atpresent there are too many technical hurdles for most individuals in the huma-nities to consider analysing at scale these increasingly available open data setsand by building on existing frameworks of support from research computing andlibrary services we can best support humanities scholars in developing methodsand approaches to take advantage of these research opportunities
1 Introduction
How best can humanities researchers access andanalyse large-scale digital data sets available frominstitutions in the cultural and heritage sectorWhat barriers remain in place for those from thehumanities wishing to use high-performance com-puting (HPC) to provide insights into historicaldata sets using lsquobig-datarsquo analytical techniquesThis article describes a pilot project that workedin collaboration with non-computationally trainedhumanities researchers to identify and overcomebarriers to complex analysis of large-scale digitalcollections It used institutional university frame-works that routinely support the processing oflarge-scale data sets for research purposes in thesciences The project brought together humanitiesresearchers research software engineers (Hettrick2016) and information professionals from theBritish Library Digital Scholarship Department1
University College London (UCL) Centre forDigital Humanities2 UCL Centre for AdvancedSpatial Analysis3 and UCL Research IT Services(UCL RITS)4 to analyse an open-licenced large-scale data set from the British Library While
useful research results were generated undertakingthis project clarified the technical and proceduralbarriers that exist when humanities researchers at-tempt to utilize computational research infrastruc-tures in the pursuit of their own research questions
2 Overview
The drive in the gallery library archive andmuseum (GLAM) sector towards opening up collec-tions data5 as well as the growth in data publishedby publicly funded research projects means huma-nities researchers have a wealth of large-scale digitalcollections available to them (Lui 2015 Terras2015) Many of these data sets are released underopen licences that permit uninhibited use by anyonewith an Internet collection and modest storage cap-acity A few humanities researchers have exploitedthese resources and their interpretations makeclaims that change our understanding of culturalphenomena (Smith et al 2013 Schmidt 2014Smith et al 2015 Huber 2007 Leetaru 2015)Nevertheless there remain major barriers to thewidespread uptake of these data sets and related
Correspondence
Melissa Terras UCL
Department of Information
Studies Foster Court
University College London
Gower Street London
WC1E 6BT UK
mterrasuclacuk
M Terras et al
2 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computational approaches by humanities re-searchers which risks diminishing the relevance ofthe humanities in lsquobig datarsquo analysis (Wynne 2015)These barriers include
fragmentation of communities resources andtools
lack of interoperability complexity and incompleteness of heterogeneous
cultural heritage data sets (Terras 2009) and lack of technical skills lsquomainstream researchers
in the humanities and social sciences often donrsquotknow what the new possibilities arersquo (ibid) andseldom have the technical experience to experi-ment (Hughes 2009 Mahony and Pierazzo2012)
A common response to this lack of awarenessand computational skills is to build Web-basedinterfaces to data6 or federated services and infra-structures7 While these interfaces play a positiverole in introducing humanities researchers tolarge-scale digital collections they rarely fulfil thecomplex needs of humanities research which con-stantly questions received approaches and results orallow researchers to tailor analysis without beinglimited by shared assumptions and methods(Wynne 2013)
3 Method
We explored the challenges associated with deploy-ing and working with large-scale digital collectionssuitable for humanities research using a publicdomain digital collection provided by the BritishLibrary8 This circa 60000-book data set covers fic-tion and non-fiction publications from the 17th18th and 19th centuries ormdashseen as datamdash224 GB of compressed ALTO XML that includesboth content (captured using an Optical CharacterRecognition (OCR) process) and the location ofthat content on a page9 Using UCLrsquos centrallyfunded computing facilities10 we worked fromMarchndashJuly 2015 with UCL RITS and a cohort offour humanities researchers (from doctoral candi-dates to mid-career scholars) to ask queries thatcould not be satisfied by search- and discovery-
orientated graphical user interfaces Working in col-laboration we turned their research questions intocomputational queries explored ways in which thereturned data could be visualized and capturedtheir thoughts on the process through semi-struc-tured interviews
4 Results
We successfully ran queries across the data set thattracked linguistic change identified core phrasesplotted the placement of illustrations and mappedlocations mentioned within core texts The semi-structured interviews conducted with non-compu-tationally trained humanities researchers at variousstages during the collaborative work supported fourkey findings First that breaking down a researchquestion into a series of more defined computa-tional queries was time-consuming and challengingSecondly that the iterative nature of this researchmethodology puts pressure on the time taken toexecute queries and that long processing timeswere frustrating Thirdly that full comprehensionof the programming code was not necessary to pro-cess data and use their outputs in research thoughunderstanding the inputs outputs and effects ofparameters was required Fourthly that creatingderived data sets of a size manageable by desktopPCs11 opened up further investigation using estab-lished methods Indeed we found that buildingqueries that generate derived data sets from large-scale digital collections (small enough to be workedon locally with familiar tools) is an effective meansof empowering non-computationally trained huma-nities researchers to develop the skill sets required toundertake complex analysis of humanities data12
Our case studies deepen and add nuance to thesefindings Two of our case studies were interested inlooking at instances of particular words or phrasesin the corpus (for example lsquoprofessorrsquo) or particu-lar combinations of phrases within the corpus(lsquohigher educationrsquo) to identify a particular institu-tion and group of persons across time The require-ments from the researchers were to return thecomplete page of text that surrounded each ex-ample This was found to be technically quite
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 3 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
straightforward and resulted in a text file being de-livered to the Humanities Scholars which they couldthen lsquoclose readrsquo to analyse each instance of thesearch term within a given page of the book in thecorpus Analysis in this case entails finding instancesof the search term in question however there arefurther possibilities that can interrogate the data setfurther in procedurally and methodologically novelways We present here two more ambitious casestudies that allowed for further visualization andanalysis
41 Case Study 1 history of medicineDuke-Williams is a senior lecturer in DigitalInformation Studies in the Department ofInformation Studies at UCL13 and his researchinterests include the presentation of spatial dataand dissemination of demographic data and thepast present and future of demographic data
capture in the UK Visualization of these kinds ofdata can be used to explore issues around the spreadof diseases and the research questions were howdoes the occurrence of diseases in published litera-ture compare to known epidemics in the 19th cen-tury Can we see any correlation between theoccurrence of infectious diseases in society and ref-erence to these diseases in both fiction and non-fiction
Variations in the number of mentions of cholera(Fig 1 continuous black line) were compared torecorded epidemics (shaded bars on Fig 1)A sharp rise in mentions coincides with the firstcholera epidemic in the UK of 1831ndash32 a similarbut less pronounced rise is coincident with the1848ndash49 epidemic A more volatile pattern of men-tions is observed after this point subsequent spikesmay be associated with epidemics within andbeyond the UK or may be less directly related to
0
05
1
15
2
25
3
1790 1810 1830 1850 1870 1890 1910
Num
ber
of re
fere
nces
per
100
0 w
ords
Date of Publication
Disease References
year_cholera_epidemics Whooping Cholera Consumpon Measles
Fig 1 A search for mentions of various infectious diseases (cholera whooping cough consumption and measles)across the 60000-book data set We compared the profound spikes for cholera in the data set with known dataregarding epidemics in the UK (Chadwick 1842 Wall 1893) which appear as the bars on the graph showing arelationship between the first major UK outbreak of cholera and its appearance within the written record of thetime (in 1831ndash32) and again with the second UK epidemic (1848ndash49) Later outbreaks (1853ndash54 and 1863) do notsee this same correlation There are further pronounced spikes for mentions of cholera in the1870s and 1880s these arenot associated with UK epidemics but there were outbreaks in the USA and elsewhere Identifying the texts that refer tothese outbreaks allows us to look more closely at these clusters and to understand the relationship between publichealth epidemiology and the published historical record
M Terras et al
4 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease
This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets
42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14
Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale
Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the
changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections
To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries
First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes
Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 5 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration
Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works
Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject
5 Infrastructuralrecommendations
From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of
0
25
50
75
100
7505002500
Imag
e si
ze (
of p
age)
Page number
Illustration Area by Page for Book ID 002322267
002322267_1 002322267_2
Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume
M Terras et al
6 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)
The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future
Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be
presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)
A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services
Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 7 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward
6 Conclusion
We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities
(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences
(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data
Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and
deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity
AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance
ReferencesAtkins D E Borgman C L Bindhoff N Ellisman
M Felman S Foster I and Heck A (2010)
lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils
UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-
for-the-transformative-enhancement-of-research-and-
innovation
Bates M J (1996) The Getty end-user online searching
project in the humanities report no 6 overview and
conclusions College and Research Libraries 57(6)514ndash23
Bedi S and Walde C (2016) Transforming roles
Canadian academic librarians embedded in faculty re-
search projects College and Research Libraries http
M Terras et al
8 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
crlacrlorgcontentearly20160322crl16-871
abstract
Bradigan P S and Mularski C A (1989) End-user
searching in a medical school curriculum an evaluated
modular approach Bulletin of the Medical Library
Association 77(4) 348ndash56
Brettle A Maden-Jenkins M Anderson L McNally
R Pratchett T Tancock J Thornton D and
Webb A (2011) Evaluating clinical librarian services
A systematic review Health Information and Libraries
Journal 28(1) 3ndash22
Burke J and Tumbleson B (2016) LMS embedded li-
brarianship and the educational role of librarians
Library Technology Reports 52(2) 5ndash9
Chadwick E (1842) Report on the sanitary condition of
the labouring population of Great Britain supplemen-
tary report on the results of special inquiry into the
practice of interment in towns (Vol 1) HM
Stationery Office
Donald D (1996) The Age of Caricature Satirical Prints
in the Reign of George III New Haven Published for the
Paul Mellon Centre for Studies in British Art by Yale
University Press
Farber M and Shoham S (2002) Users end-users
and end-user searchers of online information a his-
torical overview Online Information Review 26(2)
92ndash100
Feliu V and Frazer H (2012) Embedded librarians
teaching legal research as a lawyering skill Journal of
Legal Education 61(4) 540ndash59
Groen D Hetherington J Carver H B Nash R
W Bernabeu M O and Coveney P V (2013)
Analysing and modelling the performance of the
HemeLB lattice-Boltzmann simulation environ-
ment Journal of Computational Science 4(5) 412ndash
22
Hetherington J (2017) Question about data capacity
and use at UCL Response to Melissa Terras via email
27 January 2017
Hettrick S (2016) A not-so-brief history of Research
Software Engineers Software Sustainability Institute
httpswwwsoftwareacukblog2016-08-19-not-so-
brief-history-research-software-engineers
Huber M (2007) The Old Bailey Proceedings 1674-1834
Evaluating and annotating a corpus of 18th - and 19th-
century spoken English Studies in Variation Contacts and
Change in English 1 Annotating Variation and Change
httpwwwhelsinkifivariengseriesvolumes01huber
Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx
Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30
Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86
Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books
Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13
Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets
Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne
Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics
Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress
Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81
Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30
Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 9 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computational approaches by humanities re-searchers which risks diminishing the relevance ofthe humanities in lsquobig datarsquo analysis (Wynne 2015)These barriers include
fragmentation of communities resources andtools
lack of interoperability complexity and incompleteness of heterogeneous
cultural heritage data sets (Terras 2009) and lack of technical skills lsquomainstream researchers
in the humanities and social sciences often donrsquotknow what the new possibilities arersquo (ibid) andseldom have the technical experience to experi-ment (Hughes 2009 Mahony and Pierazzo2012)
A common response to this lack of awarenessand computational skills is to build Web-basedinterfaces to data6 or federated services and infra-structures7 While these interfaces play a positiverole in introducing humanities researchers tolarge-scale digital collections they rarely fulfil thecomplex needs of humanities research which con-stantly questions received approaches and results orallow researchers to tailor analysis without beinglimited by shared assumptions and methods(Wynne 2013)
3 Method
We explored the challenges associated with deploy-ing and working with large-scale digital collectionssuitable for humanities research using a publicdomain digital collection provided by the BritishLibrary8 This circa 60000-book data set covers fic-tion and non-fiction publications from the 17th18th and 19th centuries ormdashseen as datamdash224 GB of compressed ALTO XML that includesboth content (captured using an Optical CharacterRecognition (OCR) process) and the location ofthat content on a page9 Using UCLrsquos centrallyfunded computing facilities10 we worked fromMarchndashJuly 2015 with UCL RITS and a cohort offour humanities researchers (from doctoral candi-dates to mid-career scholars) to ask queries thatcould not be satisfied by search- and discovery-
orientated graphical user interfaces Working in col-laboration we turned their research questions intocomputational queries explored ways in which thereturned data could be visualized and capturedtheir thoughts on the process through semi-struc-tured interviews
4 Results
We successfully ran queries across the data set thattracked linguistic change identified core phrasesplotted the placement of illustrations and mappedlocations mentioned within core texts The semi-structured interviews conducted with non-compu-tationally trained humanities researchers at variousstages during the collaborative work supported fourkey findings First that breaking down a researchquestion into a series of more defined computa-tional queries was time-consuming and challengingSecondly that the iterative nature of this researchmethodology puts pressure on the time taken toexecute queries and that long processing timeswere frustrating Thirdly that full comprehensionof the programming code was not necessary to pro-cess data and use their outputs in research thoughunderstanding the inputs outputs and effects ofparameters was required Fourthly that creatingderived data sets of a size manageable by desktopPCs11 opened up further investigation using estab-lished methods Indeed we found that buildingqueries that generate derived data sets from large-scale digital collections (small enough to be workedon locally with familiar tools) is an effective meansof empowering non-computationally trained huma-nities researchers to develop the skill sets required toundertake complex analysis of humanities data12
Our case studies deepen and add nuance to thesefindings Two of our case studies were interested inlooking at instances of particular words or phrasesin the corpus (for example lsquoprofessorrsquo) or particu-lar combinations of phrases within the corpus(lsquohigher educationrsquo) to identify a particular institu-tion and group of persons across time The require-ments from the researchers were to return thecomplete page of text that surrounded each ex-ample This was found to be technically quite
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 3 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
straightforward and resulted in a text file being de-livered to the Humanities Scholars which they couldthen lsquoclose readrsquo to analyse each instance of thesearch term within a given page of the book in thecorpus Analysis in this case entails finding instancesof the search term in question however there arefurther possibilities that can interrogate the data setfurther in procedurally and methodologically novelways We present here two more ambitious casestudies that allowed for further visualization andanalysis
41 Case Study 1 history of medicineDuke-Williams is a senior lecturer in DigitalInformation Studies in the Department ofInformation Studies at UCL13 and his researchinterests include the presentation of spatial dataand dissemination of demographic data and thepast present and future of demographic data
capture in the UK Visualization of these kinds ofdata can be used to explore issues around the spreadof diseases and the research questions were howdoes the occurrence of diseases in published litera-ture compare to known epidemics in the 19th cen-tury Can we see any correlation between theoccurrence of infectious diseases in society and ref-erence to these diseases in both fiction and non-fiction
Variations in the number of mentions of cholera(Fig 1 continuous black line) were compared torecorded epidemics (shaded bars on Fig 1)A sharp rise in mentions coincides with the firstcholera epidemic in the UK of 1831ndash32 a similarbut less pronounced rise is coincident with the1848ndash49 epidemic A more volatile pattern of men-tions is observed after this point subsequent spikesmay be associated with epidemics within andbeyond the UK or may be less directly related to
0
05
1
15
2
25
3
1790 1810 1830 1850 1870 1890 1910
Num
ber
of re
fere
nces
per
100
0 w
ords
Date of Publication
Disease References
year_cholera_epidemics Whooping Cholera Consumpon Measles
Fig 1 A search for mentions of various infectious diseases (cholera whooping cough consumption and measles)across the 60000-book data set We compared the profound spikes for cholera in the data set with known dataregarding epidemics in the UK (Chadwick 1842 Wall 1893) which appear as the bars on the graph showing arelationship between the first major UK outbreak of cholera and its appearance within the written record of thetime (in 1831ndash32) and again with the second UK epidemic (1848ndash49) Later outbreaks (1853ndash54 and 1863) do notsee this same correlation There are further pronounced spikes for mentions of cholera in the1870s and 1880s these arenot associated with UK epidemics but there were outbreaks in the USA and elsewhere Identifying the texts that refer tothese outbreaks allows us to look more closely at these clusters and to understand the relationship between publichealth epidemiology and the published historical record
M Terras et al
4 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease
This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets
42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14
Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale
Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the
changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections
To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries
First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes
Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 5 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration
Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works
Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject
5 Infrastructuralrecommendations
From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of
0
25
50
75
100
7505002500
Imag
e si
ze (
of p
age)
Page number
Illustration Area by Page for Book ID 002322267
002322267_1 002322267_2
Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume
M Terras et al
6 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)
The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future
Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be
presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)
A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services
Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 7 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward
6 Conclusion
We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities
(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences
(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data
Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and
deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity
AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance
ReferencesAtkins D E Borgman C L Bindhoff N Ellisman
M Felman S Foster I and Heck A (2010)
lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils
UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-
for-the-transformative-enhancement-of-research-and-
innovation
Bates M J (1996) The Getty end-user online searching
project in the humanities report no 6 overview and
conclusions College and Research Libraries 57(6)514ndash23
Bedi S and Walde C (2016) Transforming roles
Canadian academic librarians embedded in faculty re-
search projects College and Research Libraries http
M Terras et al
8 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
crlacrlorgcontentearly20160322crl16-871
abstract
Bradigan P S and Mularski C A (1989) End-user
searching in a medical school curriculum an evaluated
modular approach Bulletin of the Medical Library
Association 77(4) 348ndash56
Brettle A Maden-Jenkins M Anderson L McNally
R Pratchett T Tancock J Thornton D and
Webb A (2011) Evaluating clinical librarian services
A systematic review Health Information and Libraries
Journal 28(1) 3ndash22
Burke J and Tumbleson B (2016) LMS embedded li-
brarianship and the educational role of librarians
Library Technology Reports 52(2) 5ndash9
Chadwick E (1842) Report on the sanitary condition of
the labouring population of Great Britain supplemen-
tary report on the results of special inquiry into the
practice of interment in towns (Vol 1) HM
Stationery Office
Donald D (1996) The Age of Caricature Satirical Prints
in the Reign of George III New Haven Published for the
Paul Mellon Centre for Studies in British Art by Yale
University Press
Farber M and Shoham S (2002) Users end-users
and end-user searchers of online information a his-
torical overview Online Information Review 26(2)
92ndash100
Feliu V and Frazer H (2012) Embedded librarians
teaching legal research as a lawyering skill Journal of
Legal Education 61(4) 540ndash59
Groen D Hetherington J Carver H B Nash R
W Bernabeu M O and Coveney P V (2013)
Analysing and modelling the performance of the
HemeLB lattice-Boltzmann simulation environ-
ment Journal of Computational Science 4(5) 412ndash
22
Hetherington J (2017) Question about data capacity
and use at UCL Response to Melissa Terras via email
27 January 2017
Hettrick S (2016) A not-so-brief history of Research
Software Engineers Software Sustainability Institute
httpswwwsoftwareacukblog2016-08-19-not-so-
brief-history-research-software-engineers
Huber M (2007) The Old Bailey Proceedings 1674-1834
Evaluating and annotating a corpus of 18th - and 19th-
century spoken English Studies in Variation Contacts and
Change in English 1 Annotating Variation and Change
httpwwwhelsinkifivariengseriesvolumes01huber
Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx
Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30
Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86
Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books
Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13
Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets
Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne
Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics
Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress
Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81
Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30
Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 9 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
straightforward and resulted in a text file being de-livered to the Humanities Scholars which they couldthen lsquoclose readrsquo to analyse each instance of thesearch term within a given page of the book in thecorpus Analysis in this case entails finding instancesof the search term in question however there arefurther possibilities that can interrogate the data setfurther in procedurally and methodologically novelways We present here two more ambitious casestudies that allowed for further visualization andanalysis
41 Case Study 1 history of medicineDuke-Williams is a senior lecturer in DigitalInformation Studies in the Department ofInformation Studies at UCL13 and his researchinterests include the presentation of spatial dataand dissemination of demographic data and thepast present and future of demographic data
capture in the UK Visualization of these kinds ofdata can be used to explore issues around the spreadof diseases and the research questions were howdoes the occurrence of diseases in published litera-ture compare to known epidemics in the 19th cen-tury Can we see any correlation between theoccurrence of infectious diseases in society and ref-erence to these diseases in both fiction and non-fiction
Variations in the number of mentions of cholera(Fig 1 continuous black line) were compared torecorded epidemics (shaded bars on Fig 1)A sharp rise in mentions coincides with the firstcholera epidemic in the UK of 1831ndash32 a similarbut less pronounced rise is coincident with the1848ndash49 epidemic A more volatile pattern of men-tions is observed after this point subsequent spikesmay be associated with epidemics within andbeyond the UK or may be less directly related to
0
05
1
15
2
25
3
1790 1810 1830 1850 1870 1890 1910
Num
ber
of re
fere
nces
per
100
0 w
ords
Date of Publication
Disease References
year_cholera_epidemics Whooping Cholera Consumpon Measles
Fig 1 A search for mentions of various infectious diseases (cholera whooping cough consumption and measles)across the 60000-book data set We compared the profound spikes for cholera in the data set with known dataregarding epidemics in the UK (Chadwick 1842 Wall 1893) which appear as the bars on the graph showing arelationship between the first major UK outbreak of cholera and its appearance within the written record of thetime (in 1831ndash32) and again with the second UK epidemic (1848ndash49) Later outbreaks (1853ndash54 and 1863) do notsee this same correlation There are further pronounced spikes for mentions of cholera in the1870s and 1880s these arenot associated with UK epidemics but there were outbreaks in the USA and elsewhere Identifying the texts that refer tothese outbreaks allows us to look more closely at these clusters and to understand the relationship between publichealth epidemiology and the published historical record
M Terras et al
4 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease
This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets
42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14
Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale
Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the
changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections
To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries
First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes
Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 5 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration
Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works
Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject
5 Infrastructuralrecommendations
From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of
0
25
50
75
100
7505002500
Imag
e si
ze (
of p
age)
Page number
Illustration Area by Page for Book ID 002322267
002322267_1 002322267_2
Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume
M Terras et al
6 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)
The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future
Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be
presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)
A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services
Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 7 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward
6 Conclusion
We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities
(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences
(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data
Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and
deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity
AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance
ReferencesAtkins D E Borgman C L Bindhoff N Ellisman
M Felman S Foster I and Heck A (2010)
lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils
UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-
for-the-transformative-enhancement-of-research-and-
innovation
Bates M J (1996) The Getty end-user online searching
project in the humanities report no 6 overview and
conclusions College and Research Libraries 57(6)514ndash23
Bedi S and Walde C (2016) Transforming roles
Canadian academic librarians embedded in faculty re-
search projects College and Research Libraries http
M Terras et al
8 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
crlacrlorgcontentearly20160322crl16-871
abstract
Bradigan P S and Mularski C A (1989) End-user
searching in a medical school curriculum an evaluated
modular approach Bulletin of the Medical Library
Association 77(4) 348ndash56
Brettle A Maden-Jenkins M Anderson L McNally
R Pratchett T Tancock J Thornton D and
Webb A (2011) Evaluating clinical librarian services
A systematic review Health Information and Libraries
Journal 28(1) 3ndash22
Burke J and Tumbleson B (2016) LMS embedded li-
brarianship and the educational role of librarians
Library Technology Reports 52(2) 5ndash9
Chadwick E (1842) Report on the sanitary condition of
the labouring population of Great Britain supplemen-
tary report on the results of special inquiry into the
practice of interment in towns (Vol 1) HM
Stationery Office
Donald D (1996) The Age of Caricature Satirical Prints
in the Reign of George III New Haven Published for the
Paul Mellon Centre for Studies in British Art by Yale
University Press
Farber M and Shoham S (2002) Users end-users
and end-user searchers of online information a his-
torical overview Online Information Review 26(2)
92ndash100
Feliu V and Frazer H (2012) Embedded librarians
teaching legal research as a lawyering skill Journal of
Legal Education 61(4) 540ndash59
Groen D Hetherington J Carver H B Nash R
W Bernabeu M O and Coveney P V (2013)
Analysing and modelling the performance of the
HemeLB lattice-Boltzmann simulation environ-
ment Journal of Computational Science 4(5) 412ndash
22
Hetherington J (2017) Question about data capacity
and use at UCL Response to Melissa Terras via email
27 January 2017
Hettrick S (2016) A not-so-brief history of Research
Software Engineers Software Sustainability Institute
httpswwwsoftwareacukblog2016-08-19-not-so-
brief-history-research-software-engineers
Huber M (2007) The Old Bailey Proceedings 1674-1834
Evaluating and annotating a corpus of 18th - and 19th-
century spoken English Studies in Variation Contacts and
Change in English 1 Annotating Variation and Change
httpwwwhelsinkifivariengseriesvolumes01huber
Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx
Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30
Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86
Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books
Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13
Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets
Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne
Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics
Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress
Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81
Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30
Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 9 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
disease incidence Identifying the range and type oftexts (whether epidemiological reports or worksaimed at a wider audience) may help to informand understand the cultural response to disease
This work opens up possibilities for our under-standing of trends in both fiction and non-fictionand could be linked into further data sets (for ex-ample of digitized historical newspaper data)In the case of our pilot project it demonstratedthat we could graph and visualize searches basedon the corpus to present overviews that wereuseful to our researcher but only in conjunctionwith both our research software engineer and ourinformation visualization expert this service thenmdashas a result of the person hours requiredmdashdoes notscale in practice demonstrating both the potentialin the data set and the current limited opportunitieshistorians epidemiologists and historians of sci-ence have to generate such visualizations fromopen-licenced data sets
42 Case Study 2 the history of imagesFinley is a doctoral candidate on the British Libraryand University of Sheffield Collaborative PhDStudentship lsquoThe Printed Image 1750ndash1850 towardsa Digital History of Printed Book Illustrationrsquo14
Between 1750 and 1850 changes in printing tech-nology enabled several kinds of image to proliferateand for image and text to be brought together innovel and unexpected ways Existing printing tech-nologiesmdashsuch as woodcutsmdashcontinued alongsidenew printing technologies shaping the dissemin-ation reuse and meaning of the designs they con-veyed (Stijnman 2012 Maidment 2013) Tounderstand these changes scholars have so farsampled small hand-crafted collections of imagesan approach repeated in the fields of art and culturalhistory (Donald 1996 Thomas 2004) Yet digitalsources allow us to study these changes with a muchlarger sample to use visual content as well as meta-data to grapple with past phenomena at scale
Finleyrsquos research focuses on the digital imagesfrom the same 60000-book data set our projectuses The research addresses questions such asHow did changes in image techniques and the sizeof images map onto the different genres over timeWhat do quantitative findings reveal about the
changing meanings of images from one genre tothe next How do the findings made possibleusing digital humanities techniques and digitalsources compare to those using traditional methodsand small hand-crafted collections
To support these research questions we queriedthe book XML to extract the coordinates of theboundary boxes put around each area the OCRprocess defined as an image The resulting deriveddata lists the title author place of publication andreference number for each book For each of the 1million images in these books the derived data liststhe page number it appears on x-position of thetop left corner y-position of the top left corner itswidth its height and its overall size as a percentageof the page We then took two approaches to turnFinleyrsquos research questions into computationalqueries
First we used the data derived from the HPC togenerate a graph of the instances of images by theirsize as a percentage of the page over time (Fig 2)This enabled Finley to observe the dominance of fullpage and very small images (lt15 of the page)between the 1750s and 1810s after which timemdashdriven by novel deployment of woodcuts andlithographs in booksmdashthe range of figure sizes
Fig 2 A search for figures between 1750 and 1850plotted according to the size of each figure in relationto the size of the page
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 5 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration
Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works
Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject
5 Infrastructuralrecommendations
From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of
0
25
50
75
100
7505002500
Imag
e si
ze (
of p
age)
Page number
Illustration Area by Page for Book ID 002322267
002322267_1 002322267_2
Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume
M Terras et al
6 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)
The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future
Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be
presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)
A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services
Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 7 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward
6 Conclusion
We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities
(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences
(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data
Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and
deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity
AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance
ReferencesAtkins D E Borgman C L Bindhoff N Ellisman
M Felman S Foster I and Heck A (2010)
lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils
UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-
for-the-transformative-enhancement-of-research-and-
innovation
Bates M J (1996) The Getty end-user online searching
project in the humanities report no 6 overview and
conclusions College and Research Libraries 57(6)514ndash23
Bedi S and Walde C (2016) Transforming roles
Canadian academic librarians embedded in faculty re-
search projects College and Research Libraries http
M Terras et al
8 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
crlacrlorgcontentearly20160322crl16-871
abstract
Bradigan P S and Mularski C A (1989) End-user
searching in a medical school curriculum an evaluated
modular approach Bulletin of the Medical Library
Association 77(4) 348ndash56
Brettle A Maden-Jenkins M Anderson L McNally
R Pratchett T Tancock J Thornton D and
Webb A (2011) Evaluating clinical librarian services
A systematic review Health Information and Libraries
Journal 28(1) 3ndash22
Burke J and Tumbleson B (2016) LMS embedded li-
brarianship and the educational role of librarians
Library Technology Reports 52(2) 5ndash9
Chadwick E (1842) Report on the sanitary condition of
the labouring population of Great Britain supplemen-
tary report on the results of special inquiry into the
practice of interment in towns (Vol 1) HM
Stationery Office
Donald D (1996) The Age of Caricature Satirical Prints
in the Reign of George III New Haven Published for the
Paul Mellon Centre for Studies in British Art by Yale
University Press
Farber M and Shoham S (2002) Users end-users
and end-user searchers of online information a his-
torical overview Online Information Review 26(2)
92ndash100
Feliu V and Frazer H (2012) Embedded librarians
teaching legal research as a lawyering skill Journal of
Legal Education 61(4) 540ndash59
Groen D Hetherington J Carver H B Nash R
W Bernabeu M O and Coveney P V (2013)
Analysing and modelling the performance of the
HemeLB lattice-Boltzmann simulation environ-
ment Journal of Computational Science 4(5) 412ndash
22
Hetherington J (2017) Question about data capacity
and use at UCL Response to Melissa Terras via email
27 January 2017
Hettrick S (2016) A not-so-brief history of Research
Software Engineers Software Sustainability Institute
httpswwwsoftwareacukblog2016-08-19-not-so-
brief-history-research-software-engineers
Huber M (2007) The Old Bailey Proceedings 1674-1834
Evaluating and annotating a corpus of 18th - and 19th-
century spoken English Studies in Variation Contacts and
Change in English 1 Annotating Variation and Change
httpwwwhelsinkifivariengseriesvolumes01huber
Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx
Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30
Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86
Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books
Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13
Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets
Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne
Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics
Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress
Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81
Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30
Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 9 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
diversified Although the graph is not normalized bythe number of images in the data set for each yearand is therefore dominated by the greater volume ofbooks in the data set after 1800 it has proved auseful reference and a new way into the macro pat-terns of book illustration
Secondly we wrote a script that could be runlocally in R to create graphs based on image datafor single books Plotting the page number on thex-axis and figures as a percentage of the page on they-axis the script generated a visual representation ofthe size of illustrations in a book Finley selectedbooks for analysis to observe patterns in the use ofillustrations in books on history geology and top-ography (the subjects of his doctoral research) Here(Fig 3) we see this for the 1817 A new and completeSystem of Modern Geography a two-volume workpublished in Newcastle upon Tyne by Mackenzie(1817) The discrepancies in use of illustrations be-tween the two volumes took Finley back to thephysical books to assess how the placement andsize of images changed the reading experience be-tween volumes and to compare the findings withsimilar multi-volume works
Subsequent to the project Finley has continuedto use the project data and scripts in his researchFor example he has used the image location data toplot and compare the average position images inbooks This has underscored the value of generatingderived data that can be used locally by a researcheroutside the context of a HPC facility and a fundedproject
5 Infrastructuralrecommendations
From a technical perspective this pilot highlightedvarious sticking points when using infrastructuredeveloped predominantly for scientific researchThe combined data input and output volumeundertaken during our work (less than 300 GB) isonly moderately large by comparison to the scien-tific data sets UCL RITS usually encounters for al-though there are shared assumptions betweenresearch infrastructures (adoption of technicalstandards and the sharing of tools approachesand research outputs (Wynne 2015)) most of theUKrsquos university eScience15 infrastructure has beenconstructed specifically to run scientific and engin-eering simulations not for search and analysis of thetypes of heterogeneous data sets we see emanatingfrom cultural heritage institutions We had a largetextual input (224 GB) a simple calculation and asmall output summary of only a few KB By com-parison the typical engineering simulationaddresses moderately sized numerical input dataruns a long complicated calculation and producesa large output (multiple TBs) The average data sizeof project using the UCL data storage service is44 TB (Hetherington 2017) For example thework of the UCL Centre for ComputationalScience16 on brain blood flow simulations takes aninput file of around 1 GB and for a full productionsimulation recording a snapshot once every 200time steps produces 20 TB of output (Groen etal 2013) Poor uptake in the Arts andHumanities (Atkins et al 2010 Voss et al 2010)has meant that these computational systems havenot been optimized for Arts and Humanities work-loads The file system and network configuration of
0
25
50
75
100
7505002500
Imag
e si
ze (
of p
age)
Page number
Illustration Area by Page for Book ID 002322267
002322267_1 002322267_2
Fig 3 A search for figures in A new and complete Systemof Modern Geography two volumes (Mackenzie 1817)plotted by page (x-axis) percentage of page the eachfigure occupies (y-axis) and separated by volume
M Terras et al
6 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)
The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future
Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be
presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)
A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services
Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 7 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward
6 Conclusion
We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities
(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences
(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data
Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and
deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity
AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance
ReferencesAtkins D E Borgman C L Bindhoff N Ellisman
M Felman S Foster I and Heck A (2010)
lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils
UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-
for-the-transformative-enhancement-of-research-and-
innovation
Bates M J (1996) The Getty end-user online searching
project in the humanities report no 6 overview and
conclusions College and Research Libraries 57(6)514ndash23
Bedi S and Walde C (2016) Transforming roles
Canadian academic librarians embedded in faculty re-
search projects College and Research Libraries http
M Terras et al
8 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
crlacrlorgcontentearly20160322crl16-871
abstract
Bradigan P S and Mularski C A (1989) End-user
searching in a medical school curriculum an evaluated
modular approach Bulletin of the Medical Library
Association 77(4) 348ndash56
Brettle A Maden-Jenkins M Anderson L McNally
R Pratchett T Tancock J Thornton D and
Webb A (2011) Evaluating clinical librarian services
A systematic review Health Information and Libraries
Journal 28(1) 3ndash22
Burke J and Tumbleson B (2016) LMS embedded li-
brarianship and the educational role of librarians
Library Technology Reports 52(2) 5ndash9
Chadwick E (1842) Report on the sanitary condition of
the labouring population of Great Britain supplemen-
tary report on the results of special inquiry into the
practice of interment in towns (Vol 1) HM
Stationery Office
Donald D (1996) The Age of Caricature Satirical Prints
in the Reign of George III New Haven Published for the
Paul Mellon Centre for Studies in British Art by Yale
University Press
Farber M and Shoham S (2002) Users end-users
and end-user searchers of online information a his-
torical overview Online Information Review 26(2)
92ndash100
Feliu V and Frazer H (2012) Embedded librarians
teaching legal research as a lawyering skill Journal of
Legal Education 61(4) 540ndash59
Groen D Hetherington J Carver H B Nash R
W Bernabeu M O and Coveney P V (2013)
Analysing and modelling the performance of the
HemeLB lattice-Boltzmann simulation environ-
ment Journal of Computational Science 4(5) 412ndash
22
Hetherington J (2017) Question about data capacity
and use at UCL Response to Melissa Terras via email
27 January 2017
Hettrick S (2016) A not-so-brief history of Research
Software Engineers Software Sustainability Institute
httpswwwsoftwareacukblog2016-08-19-not-so-
brief-history-research-software-engineers
Huber M (2007) The Old Bailey Proceedings 1674-1834
Evaluating and annotating a corpus of 18th - and 19th-
century spoken English Studies in Variation Contacts and
Change in English 1 Annotating Variation and Change
httpwwwhelsinkifivariengseriesvolumes01huber
Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx
Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30
Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86
Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books
Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13
Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets
Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne
Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics
Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress
Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81
Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30
Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 9 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
LegionmdashUCL RITSrsquos centrally funded resource forrunning complex and large computational scientificqueries across a large number of coresmdashdid notmatch the way that the data set in question wasstructured (a large number of small zipped XMLfiles)
The complexities associated with redeployingarchitectures designed to work with scientific data(massive yet very structured) to the processing ofhumanities data (not massive but more unstruc-tured) should not be understated and are a majorfinding of this project Relevant libraries (such as anefficient XML processor) were needed to be installedand optimized for the hardware Also the dataneeded to be transformed to a structure that the par-allel file system (Lustre) could address efficiently(that is fewer larger files) We found that the archi-tecture at UCL which was configured for effectivecompute of scientific data was inputoutput limitedfor our processing requirements rather than compu-tationally limited Understanding the needs of ouruser community has already fed into the procure-ment and development of HPC facilities at UCL toensure that the systemsmdashwhich are available to allresearchersmdashcan deal with the variety and type ofdata that digital humanists wish to analyse in future
Best practice recommendations for similar pro-jects emerged from this work the need to buildmultiple derived data sets (counts of books andwords per year words and pages per book etc) tonormalize results and maintain statistical validitythe necessity of documenting decisions taken whenprocessing data and metadata and the value ofhaving fixed definable data for researchers to ex-plain results in relation to (and in turn the risksassociated with iterating data sets) We also dis-covered that a core set of four or five queries gavemost of the humanities researchers the type of in-formation they required to take a subset of dataaway to process effectively themselves searches forall variants of a word searches that return keywordsin context traced over time NOT searches for aword or phrase that ignored another word orphrase searches for a word when in close proximityto a second word and searches based on imagemetadata It is the subset of the data set that mosthumanities scholars required and were happy to be
presented with for further analysis (with most re-searchers wishing to see their search term in contextpresented with the complete page of the text it wasfound within to allow informed understanding)
A main finding of this pilot was given mosthumanities researchers have a research problemthat can be facilitated by a standard set of queriesacross large-scale textual data that it would be moreefficient to train a focussed group of service pro-viders to be able to generate the results needed byresearchers than providing widespread training ofhumanities academics in this area HigherEducation already employs librarians to assist insearching and training for searching (informationliteracy) and providing this professional groupwith adjustable lsquorecipesrsquo for defined computationalqueries and background training on their use wouldsituate access to infrastructure in the resource towhich humanists already turn for assistancemdashtheirsubject librarianmdashand thereby normalize such com-putational work within the general humanitiesworkflow17 In turn research software engineerscould be invoked as collaborators for their expertisesuch as for developing more complex searchesbeyond the basic recipes rather than having torepeat the defined searches across data for differentresearchers which would allow limited resources tobe used efficiently and to build on existing frame-works of support from both the library and comput-ing services
Given issues in resourcing such facilities at everyUniversity it may be more efficient for multipleHigher Education Institutions to support a specialistservice perhaps under the umbrella of the likes ofJisc Historical Texts (httphistoricaltextsjiscacuk)or national or legal deposit libraries Expertise andapproaches if not the service itself could also befacilitated through the likes of Digital ResearchInfrastructure for the Arts and Humanities(DARIAH) (httpwwwdariaheu) and CommonLanguage Resources and Technology Infrastructure(CLARIN) (httpswwwclarineu) However localsupport for researchers wishing to utilize existingeScience technologies within the Higher Educationsector should be possible Such support enables Artsand Humanities researchers to develop ongoingmutually beneficial relationships with research
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 7 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward
6 Conclusion
We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities
(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences
(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data
Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and
deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity
AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance
ReferencesAtkins D E Borgman C L Bindhoff N Ellisman
M Felman S Foster I and Heck A (2010)
lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils
UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-
for-the-transformative-enhancement-of-research-and-
innovation
Bates M J (1996) The Getty end-user online searching
project in the humanities report no 6 overview and
conclusions College and Research Libraries 57(6)514ndash23
Bedi S and Walde C (2016) Transforming roles
Canadian academic librarians embedded in faculty re-
search projects College and Research Libraries http
M Terras et al
8 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
crlacrlorgcontentearly20160322crl16-871
abstract
Bradigan P S and Mularski C A (1989) End-user
searching in a medical school curriculum an evaluated
modular approach Bulletin of the Medical Library
Association 77(4) 348ndash56
Brettle A Maden-Jenkins M Anderson L McNally
R Pratchett T Tancock J Thornton D and
Webb A (2011) Evaluating clinical librarian services
A systematic review Health Information and Libraries
Journal 28(1) 3ndash22
Burke J and Tumbleson B (2016) LMS embedded li-
brarianship and the educational role of librarians
Library Technology Reports 52(2) 5ndash9
Chadwick E (1842) Report on the sanitary condition of
the labouring population of Great Britain supplemen-
tary report on the results of special inquiry into the
practice of interment in towns (Vol 1) HM
Stationery Office
Donald D (1996) The Age of Caricature Satirical Prints
in the Reign of George III New Haven Published for the
Paul Mellon Centre for Studies in British Art by Yale
University Press
Farber M and Shoham S (2002) Users end-users
and end-user searchers of online information a his-
torical overview Online Information Review 26(2)
92ndash100
Feliu V and Frazer H (2012) Embedded librarians
teaching legal research as a lawyering skill Journal of
Legal Education 61(4) 540ndash59
Groen D Hetherington J Carver H B Nash R
W Bernabeu M O and Coveney P V (2013)
Analysing and modelling the performance of the
HemeLB lattice-Boltzmann simulation environ-
ment Journal of Computational Science 4(5) 412ndash
22
Hetherington J (2017) Question about data capacity
and use at UCL Response to Melissa Terras via email
27 January 2017
Hettrick S (2016) A not-so-brief history of Research
Software Engineers Software Sustainability Institute
httpswwwsoftwareacukblog2016-08-19-not-so-
brief-history-research-software-engineers
Huber M (2007) The Old Bailey Proceedings 1674-1834
Evaluating and annotating a corpus of 18th - and 19th-
century spoken English Studies in Variation Contacts and
Change in English 1 Annotating Variation and Change
httpwwwhelsinkifivariengseriesvolumes01huber
Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx
Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30
Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86
Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books
Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13
Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets
Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne
Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics
Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress
Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81
Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30
Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 9 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
computing within their own institution This inturn can encourage other researchers to use theseresources (rather than them being only available as aspecialist service which users have to seek out)Research computing infrastructure across the uni-versity sector will not meet the needs of ArtsHumanities and Social Sciences researchers unlessacademics in these fields becomes active users ofthe systems and their requirements can be takeninto account going forward
6 Conclusion
We successfully mounted large-scale humanitiesdata on HPC University infrastructure in an inter-disciplinary project that required input from manyprofessionals to aid the humanities scholars intheir research tasks The collaborative approachwe undertook in this project is labour-intensiveand does not scale This should not however dis-courage the sector from taking this work forwardWe found that many research questions can beexpressed with similar computational queriesalbeit with parameters adjusted to suit We recom-mend therefore that Higher EducationInstitutions or HEI clusters looking to build cap-acity for enabling complex analysis of large-scaledigital collections by their non-computationallytrained humanities researchers should considerthe following activities
(1) Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector to facilitate research in the arts huma-nities and social and historical sciences
(2) Invest in training library staff to run theseinitial queries in collaboration with huma-nities faculty to support work with subsetsof data that are produced and to documentand manage resulting code and derived data
Our pilot project demonstrates that there are atpresent too many technical hurdles for most indi-viduals in the arts and humanities to consider ana-lysing large-scale open data sets Those hurdles canbe removed with initial help in ingest and
deployment of the data and the provision of spe-cific structured training and support which willallow humanities researchers to get to a subset ofuseful data they can comfortably and more simplyprocess themselves without the need for extensivesupport While we together with our partners haveplans to continue expanding the range and depth ofresearch carried out on our chosen data set thisproject has signposted many of the barriers to en-courage greater uptake of lsquobig datarsquo research acrossthe Arts and Humanities These findings should beof use to researchers wishing to use comparableapproaches and to service providers in researchcomputing aiming to encourage the use of sharedcomputational facilities by the Arts and Humanitiescommunity
AcknowledgementsThis project was funded by a pilot grant from theJisc Research Data Spring (JISC 3548) between Apriland July 2015 See httpswwwjiscacukrdpro-jectsresearch-data-spring for further details Theauthors thank our colleagues in the British Libraryand UCL RITS (particularly Clare Gryce) for theirsupport Geoffrey Rockwell and Stefan Sinclair fordiscussions on the current limitations of text ana-lysis for individual humanities scholars and Scott BWeingart for his assistance
ReferencesAtkins D E Borgman C L Bindhoff N Ellisman
M Felman S Foster I and Heck A (2010)
lsquolsquoRCUK Review of e-Science 2009rsquorsquo Research Councils
UK httpswwwepsrcacuknewseventspubsrcuk-review-of-e-science-2009-building-a-uk-foundation-
for-the-transformative-enhancement-of-research-and-
innovation
Bates M J (1996) The Getty end-user online searching
project in the humanities report no 6 overview and
conclusions College and Research Libraries 57(6)514ndash23
Bedi S and Walde C (2016) Transforming roles
Canadian academic librarians embedded in faculty re-
search projects College and Research Libraries http
M Terras et al
8 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
crlacrlorgcontentearly20160322crl16-871
abstract
Bradigan P S and Mularski C A (1989) End-user
searching in a medical school curriculum an evaluated
modular approach Bulletin of the Medical Library
Association 77(4) 348ndash56
Brettle A Maden-Jenkins M Anderson L McNally
R Pratchett T Tancock J Thornton D and
Webb A (2011) Evaluating clinical librarian services
A systematic review Health Information and Libraries
Journal 28(1) 3ndash22
Burke J and Tumbleson B (2016) LMS embedded li-
brarianship and the educational role of librarians
Library Technology Reports 52(2) 5ndash9
Chadwick E (1842) Report on the sanitary condition of
the labouring population of Great Britain supplemen-
tary report on the results of special inquiry into the
practice of interment in towns (Vol 1) HM
Stationery Office
Donald D (1996) The Age of Caricature Satirical Prints
in the Reign of George III New Haven Published for the
Paul Mellon Centre for Studies in British Art by Yale
University Press
Farber M and Shoham S (2002) Users end-users
and end-user searchers of online information a his-
torical overview Online Information Review 26(2)
92ndash100
Feliu V and Frazer H (2012) Embedded librarians
teaching legal research as a lawyering skill Journal of
Legal Education 61(4) 540ndash59
Groen D Hetherington J Carver H B Nash R
W Bernabeu M O and Coveney P V (2013)
Analysing and modelling the performance of the
HemeLB lattice-Boltzmann simulation environ-
ment Journal of Computational Science 4(5) 412ndash
22
Hetherington J (2017) Question about data capacity
and use at UCL Response to Melissa Terras via email
27 January 2017
Hettrick S (2016) A not-so-brief history of Research
Software Engineers Software Sustainability Institute
httpswwwsoftwareacukblog2016-08-19-not-so-
brief-history-research-software-engineers
Huber M (2007) The Old Bailey Proceedings 1674-1834
Evaluating and annotating a corpus of 18th - and 19th-
century spoken English Studies in Variation Contacts and
Change in English 1 Annotating Variation and Change
httpwwwhelsinkifivariengseriesvolumes01huber
Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx
Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30
Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86
Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books
Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13
Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets
Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne
Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics
Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress
Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81
Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30
Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 9 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
crlacrlorgcontentearly20160322crl16-871
abstract
Bradigan P S and Mularski C A (1989) End-user
searching in a medical school curriculum an evaluated
modular approach Bulletin of the Medical Library
Association 77(4) 348ndash56
Brettle A Maden-Jenkins M Anderson L McNally
R Pratchett T Tancock J Thornton D and
Webb A (2011) Evaluating clinical librarian services
A systematic review Health Information and Libraries
Journal 28(1) 3ndash22
Burke J and Tumbleson B (2016) LMS embedded li-
brarianship and the educational role of librarians
Library Technology Reports 52(2) 5ndash9
Chadwick E (1842) Report on the sanitary condition of
the labouring population of Great Britain supplemen-
tary report on the results of special inquiry into the
practice of interment in towns (Vol 1) HM
Stationery Office
Donald D (1996) The Age of Caricature Satirical Prints
in the Reign of George III New Haven Published for the
Paul Mellon Centre for Studies in British Art by Yale
University Press
Farber M and Shoham S (2002) Users end-users
and end-user searchers of online information a his-
torical overview Online Information Review 26(2)
92ndash100
Feliu V and Frazer H (2012) Embedded librarians
teaching legal research as a lawyering skill Journal of
Legal Education 61(4) 540ndash59
Groen D Hetherington J Carver H B Nash R
W Bernabeu M O and Coveney P V (2013)
Analysing and modelling the performance of the
HemeLB lattice-Boltzmann simulation environ-
ment Journal of Computational Science 4(5) 412ndash
22
Hetherington J (2017) Question about data capacity
and use at UCL Response to Melissa Terras via email
27 January 2017
Hettrick S (2016) A not-so-brief history of Research
Software Engineers Software Sustainability Institute
httpswwwsoftwareacukblog2016-08-19-not-so-
brief-history-research-software-engineers
Huber M (2007) The Old Bailey Proceedings 1674-1834
Evaluating and annotating a corpus of 18th - and 19th-
century spoken English Studies in Variation Contacts and
Change in English 1 Annotating Variation and Change
httpwwwhelsinkifivariengseriesvolumes01huber
Hughes A (2009) Higher Education in a Web 20 WorldJISC httpwwwwebarchiveorgukwaybackarchive20140614042502httpwwwjiscacukpublicationsgeneralpublications2009heweb2aspx
Ikeda N R and Schwartz D G (1992) Impact ofend-user training on Pharmacy students a four-yearfollow-up study Bulletin of the Medical LibraryAssociation 80(2) 124ndash30
Lancaster F W Elzy C Zeter M J Metzler L andLow Y M (1994) Searching databases on CD-ROMcomparison of the results of end-user searching withresults from two modes of searching by skilled inter-mediaries RQ 33(3) 370ndash86
Leetaru K (2015) History as Big Data 500 Years of BookImages and Mapping Millions of Books Forbes Techhttpwwwforbescomsiteskalevleetaru20150916history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books
Ludwig L Mixter J K and Emanuele M A (1988)User attitudes toward end-user literature searchingBulletin of the Medical Library Association 76(1)7ndash13
Lui A (2015) Data Collections and Datasets Curated byAlan Liu httpdhresourcesforprojectbuildingpbworkscomwpage69244469Data20Collections20and20Datasets
Mackenzie E (1817) A new and complete system ofmodern geography containing an accurate delineationof the world 2 Vols Printed and published byMackenzie and Dent Newcastle-upon-Tyne
Mahony S and Pierazzo E (2012) Teaching skills orteaching methodology In Hirsch B D (ed) DigitalHumanities Pedagogy Practices Principles and PoliticsOpen Book Publishers httpwwwopenbookpubl-isherscomproduct161digital-humanities-pedagogyndashpracticesndashprinciples-and-politics
Maidment B (2013) Comedy Caricature and the SocialOrder 1820-1850 Manchester Manchester UniversityPress
Markey K (2007a) Twenty-five years of end-user search-ing Part 1 research findings Journal of the AmericanSociety for Information Science and Technology 58(8)1071ndash81
Markey K (2007b) Twenty-five years of end-usersearching Part 2 future research directions Journal ofthe American Society for Information Science andTechnology 58(8) 1123ndash30
Murray T (2016)The forecast for special librariesJournal of Library Administration 56(2) 188ndash98
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 9 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
Rethlefsen M Farrell A Osterhaus Trzasko L and
Brigham T (2015) Librarian co-authors correlated
with higher quality reported search strategies in general
internal medicine systematic reviews Journal of Clinical
Epidemiology 68(6) 617ndash26
Rockwell G (2017) Question about maximum capacity
of Voyant tools Response to Melissa Terras via email 27
January 2017
Rockwell G and Sinclair S 2016 Hermeneutica
Computer-Assisted Interpretation in the Humanities
Cambridge MIT Press
Roper T (2015) The impact of the clinical librarian a
review Journal of EAHIL 11(4) 19ndash22
Schmidt B (2014) Shipping maps and how states see
Sapping Attention Blog httpsappingattentionblogspot
couk201403shipping-maps-and-how-states-seehtml
Sinclair S (2017) Question about maximum capacity of
Voyant tools Response to Melissa Terras via email 27
January 2017
Smith D A Cordell R and Dillon E M (2013)
Infectious texts modeling text reuse in nineteenth-
century newspapers In IEEE International Conference
on Big Data October 6-9 2013 Santa Clara
California IEEE pp 86ndash94 101109
BigData20136691675
Smith D Cordell R and Mullen A (2015)
Computational methods for uncovering reprinted
texts in Antebellum newspapers American Literary
History 27(3)
Starr S S and Renford B L (1987) Evaluation of a
program to teach health professionals to search
MEDLINE Bulletin of the Medical Library Association
75(3) 193ndash201
Stijnman A (2012) Engraving and Etching 1400-2000 A
History of the Development of Manual Intaglio
Printmaking Processes London Houten Netherlands
Archetype Publications in association with HES and
DE GRAAF Publishers
Straus S Eisinga A and Sackett D (2016) What
drove the evidence cart Bringing the library to the
bedside Journal of the Royal Society of Medicine
109(6) 241ndash7
Streipe T and Talley M (2013) Embedded librarian-
ship In Kroski E (ed) Law Librarianship in the Digital
Age Plymouth Scarecrow pp 13ndash30
Terras M (2009) Potentials and Problems in Applying High
Performance Computing for Research in the Arts and
Humanities Researching e-Science Analysis of Census
Holdings Digital Humanities Quarterly httpwwwdigi-
talhumanitiesorgdhqvol34000070000070html
Terras M (2015) Opening access to collections the
making and using of open digitised cultural content
Online Information Review 39(5) 733ndash52 httpwww
emeraldinsightcomdoifull101108OIR-06-2015-0193
Thomas J (2004) Pictorial Victorians The Inscription of
Values in Word and Image Athens Ohio University
Press
Voss A Asgari-Targhi M Procter R and
Fergusson D (2010) Adoption of e-Infrastructure
services configurations of practice Philosophical
Transactions Series A Mathematical Physical and
Engineering Sciences 368 4161ndash76 doi 101098
rsta20100162
Wall A J (1893) Asiatic cholera its history pathology
and modern treatment London H K Lewis
Wittek S Sinclair S and Milner M (2015) DREaM
Distant Reading Early Modernity Digital Humanities
2015 Global Digital Humanities Sydney Australia 29
Junendash3 July 2015 httpdh2015orgabstractsxml
WITTEK_Stephen_DREaM__Distant_Reading_Early_
ModerWITTEK_Stephen_DREaM__Distant_Reading_
Early_Modernityhtml
Wynne M (2013) The role of Clarin in digital trans-
formations in the humanities International Journal of
Humanities and Arts Computing 7(1-2) 89
Wynne M (2015) Big Data and Digital Transformations in
the Humanities are we there yet Textual Digital
Humanities and Social Sciences Data Aberdeen 21ndash22
September 2015 httpwwwslidesharenetmartinwynne
big-data-and-digital-transformations-in-the-humanities
Notes1 httpswwwbluksubjectsdigital-scholarship2 httpwwwuclacukdh3 httpwwwbartlettuclacukcasa4 httpwwwuclacukisdservicesresearch-it5 httpopenglamorg an initiative to promote free
and open access to digital cultural heritage data sets6 For example Mining the History of Medicine (http
nactemacukhom) or Language of the State of the
Union (httpwwwtheatlanticcompoliticsarchive
201501the-language-of-the-state-of-the-union
384575)7 For example CLARIN (httpclarineu) and
DARIAH (httpswwwdariaheu)
M Terras et al
10 of 11 Digital Scholarship in the Humanities 2017
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017
8 The British Library has various digital data sets
including (but not limited to) 7 million pages of his-
toric newspapers 1 million out of copyright book il-
lustrations 100000s of scientific articles text from
over 60000 books 1000s of UK theses and various
digitized medieval manuscripts We chose here just
one of its large-scale data sets to work with in this
pilot phase For the terms under which the British
Library makes collections available see httpwww
blukaboutustermscopyright9 The books cover a wide range of subject areas includ-
ing philosophy history poetry and literature most
are in English For a full list of metadata for this book
collection see doi 1021250DB2110 httpswikircuclacukwikiLegion just one of the
HPC facilities available at UCL for researchers see
httpwwwuclacukisdservicesresearch-itresearch-
computing11 It is difficult to put a figure on what a manageable text
file size would be for a researcher if we take that as
being able to search through it on their own machine
without further technical assistance This is dependent
on access to processing power which varies consider-
ably between desktop and laptop computers and the
complexity of the search the researcher is intending to
carry out as well as the format and granularity of the
source text Most humanities researchers should now
be able to comfortably manipulate text files of around
100 MB if they have access to a modern machine the
amount of text those files will contain is dependent on
how complex the file formats are Programs and tools
available to assist in text analysis include Voyant Tools
(httpsvoyant-toolsorg) and R (httpswwwr-pro-
jectorg) The upper limit to using server-based
Voyant is 20 MB of text which should correlate to
around 1 million words depending on format
(Sinclair 2017) Voyant Server can be downloaded
and used locally with an upper limit of around
100 MB on a standard PCMac (Rockwell 2017)
Indexing engines such as Lucene (httplucene
apacheorg) can search larger amounts of text how-
ever visualization of results will require further pro-
cessing and can be complex For further exploration of
lsquocomputer-assisted interpretation in the humanitiesrsquo
(particularly using Voyant) see Rockwell and
Sinclair (2016) and an example of the approach of
generating manageable smaller corpora from a large-
scale Early Modern Sources can be found in Wittek et
al (2015)12 All code data visualizations and other outputs from
this pilot project are freely available at httpsgithub
comUCL-dataspring13 httpswwwuclacukdispeopleoliverdukewilliams14 httpswwwshefacukhistoryresearchstudentswil-
liam-finley15 For more on the UKrsquos eScience infrastructure see the
work of the eScience Institute httpwwwesiacuk
Plan-Europe is the Platform of National eScience
Centres in Europe (httpplan-europeeu) In the
USA the equivalent of eScience is known as
Cyberinfrastructure see the National Science
Foundationrsquos guides httpwwwnsfgovdivindex
jspdivfrac14ACI
16 httpccschemuclacuk17 This approach is similar to that employed in the 1980s
when computerized searching of specialist databases
was first available (Farber and Shoham 2002
Markey 2007a 2007b) At that time evaluations car-
ried out on programs to cascade train end users via
librarians reported increased user uptake and effi-
ciency as compared to direct end-user training by
database providers (see Starr and Renford 1987
Bradigan and Mularski 1989 Ikeda and Schwartz
1992 on the library-based training of medical staff to
use MEDLINE) Significantly studies showed that
while end users wanted to be able to search databases
themselves they also desired access to searches
mediated by librarians andor a team approach to
search utilizing the end userrsquos specific knowledge of
the language and jargon of their discipline and the
information professionalrsquos understanding of
BOOLEAN and other advanced search techniques
(Ludwig et al 1988 Lancaster et al 1994 Bates
1996) The move towards embedded librarians in
law practices (Feliu and Frazer 2012 Streipe and
Talley 2013 Murray 2016) and clinical librarians in
hospitals (Brettle et al 2011 Roper 2015 Straus et
al 2016) can be seen to be a modern iteration of a
team approach to search In Higher Education subject
librarians are well-positioned to form part of a research
team using computational methods andor provide
information literacy training that upskills humanists
in computational methods (Rethlefsen et al 2015
Bedi and Walde 2016 Burke and Tumbleson 2016)
Analysis of large-scale digital collections
Digital Scholarship in the Humanities 2017 11 of 11
Downloaded from httpsacademicoupcomdsharticle-abstractdoi101093llcfqx0203789810by The University of Edinburgh useron 03 November 2017