lg-71-17-0094-17 indiana university data to insight center · lg-71-17-0094 indiana university data...

19
LG-71-17-0094-17 Indiana University Data To Insight Center Abstract Over the last decade, society is seeing a nearly exponential increase in the volume of digital content. Researchers and educators in response see the potential that Big Data techniques bring to computational exploration or cultural and scholarly digital collections for organizing, accessing, and analyzing content. Libraries have long made a mission of provisioning access services to digital content to enrich and improve the lives of all Americans, however, when digital collections have access restrictions, provisioning services becomes a challenge. We respond to this challenge with the Data Capsule service, developed in the HathiTrust Research Center, that enables remote access to restricted digital data in the HathiTrust Digital Library. Data Capsule is architected to be modular and uses application programming interfaces (APIs) for communication; this best practice in systems design plus proposed effort in packaging, will allow for faster integration into a new environment and ready contributions by third parties. In this project, we intend to partner with 8 academic libraries across the country in a multi-method research project that draws from human computer interaction and experimental computer science to: Understand current library needs and practices in provisioning library services for computational access to special collections having constraints due to sensitivity or restrictions Extend the Data Capsule service to broader needs of provisioning for analytical access to restricted collections across a range of collections and uses, Study extensions of Data Capsule to cloud computing environments for broader uses Identify gaps in skills needed for librarians to enable secure data analytics and provide resources that can address those gaps. This project proposal, responsive to the IMLS National Leadership Grants for Libraries program, is planned as a 2-year effort. If funded it will be carried out under the encompassing framework of Participatory Design and involve funded partners at Indiana University, University of Illinois, University of California at Berkeley, and University of Virginia; and engaged partners at Indiana University, Lafayette College, MIT, Rutgers University, Swarthmore College, and UCLA. In response to reviewer feedback, we increased the number of library partners in the project from 3-5 to 8, and introduced the two-tiered partner model. Level 1 partners (2) receive direct funding through the grant. Level 2 partners (6) receive travel funds built into the Indiana University grant to participate in a regional community-building event. The change resulted in an increase of about 15% from the pre-proposal. Sustainability is planned through utilizing an existing operational service, growing its adopter community (libraries), extending for broader collections and use cases. The service itself is grounded in the HathiTrust Research Center, which continues to support and endorse the Data Capsule service as its primary service for computational analysis on the nearly 15 million volumes of the HathiTrust Digital Library. HTRC deeply welcomes this initiative to involve more partners in use and sustainers of the software code base.

Upload: others

Post on 11-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094-17 Indiana University Data To Insight Center

Abstract

Overthelastdecade,societyisseeinganearlyexponentialincreaseinthevolumeofdigitalcontent.ResearchersandeducatorsinresponseseethepotentialthatBigDatatechniquesbringtocomputationalexplorationorculturalandscholarlydigitalcollectionsfororganizing,accessing,andanalyzingcontent.LibrarieshavelongmadeamissionofprovisioningaccessservicestodigitalcontenttoenrichandimprovethelivesofallAmericans,however,whendigitalcollectionshaveaccessrestrictions,provisioningservicesbecomesachallenge.

WerespondtothischallengewiththeDataCapsuleservice,developedintheHathiTrustResearchCenter,thatenablesremoteaccesstorestricteddigitaldataintheHathiTrustDigitalLibrary.DataCapsuleisarchitectedtobemodularandusesapplicationprogramminginterfaces(APIs)forcommunication;thisbestpracticeinsystemsdesignplusproposedeffortinpackaging,willallowforfasterintegrationintoanewenvironmentandreadycontributionsbythirdparties.

Inthisproject,weintendtopartnerwith8academiclibrariesacrossthecountryinamulti-methodresearchprojectthatdrawsfromhumancomputerinteractionandexperimentalcomputerscienceto:

• Understandcurrentlibraryneedsandpracticesinprovisioninglibraryservicesforcomputationalaccesstospecialcollectionshavingconstraintsduetosensitivityorrestrictions

• ExtendtheDataCapsuleservicetobroaderneedsofprovisioningforanalyticalaccesstorestrictedcollectionsacrossarangeofcollectionsanduses,

• StudyextensionsofDataCapsuletocloudcomputingenvironmentsforbroaderuses• Identifygapsinskillsneededforlibrarianstoenablesecuredataanalyticsandprovideresourcesthat

canaddressthosegaps.

Thisprojectproposal,responsivetotheIMLSNationalLeadershipGrantsforLibrariesprogram,isplannedasa2-yeareffort.IffundeditwillbecarriedoutundertheencompassingframeworkofParticipatoryDesignandinvolvefundedpartnersatIndianaUniversity,UniversityofIllinois,UniversityofCaliforniaatBerkeley,andUniversityofVirginia;andengagedpartnersatIndiana University, LafayetteCollege,MIT,RutgersUniversity,SwarthmoreCollege,andUCLA.

Inresponsetoreviewerfeedback,weincreasedthenumberoflibrarypartnersintheprojectfrom3-5to8,andintroducedthetwo-tieredpartnermodel.Level1partners(2)receivedirectfundingthroughthegrant.Level2partners(6)receivetravelfundsbuiltintotheIndianaUniversitygranttoparticipateinaregionalcommunity-buildingevent.Thechangeresultedinanincreaseofabout15%fromthepre-proposal.

Sustainabilityisplannedthroughutilizinganexistingoperationalservice,growingitsadoptercommunity(libraries),extendingforbroadercollectionsandusecases.TheserviceitselfisgroundedintheHathiTrustResearchCenter,whichcontinuestosupportandendorsetheDataCapsuleserviceasitsprimaryserviceforcomputationalanalysisonthenearly15millionvolumesoftheHathiTrustDigitalLibrary.HTRCdeeplywelcomesthisinitiativetoinvolvemorepartnersinuseandsustainersofthesoftwarecodebase.

Page 2: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.1

DataCapsuleApplianceforResearchAnalysisofRestrictedandSensitiveDatainAcademicLibraries

1.StatementofNationalNeed

Overthelastdecade,societyisseeinganearlyexponentialincreaseinthevolumeofdigitalcontent[1].Thenewcontentiscomingintoexistenceontheculturalsidethroughmassivedigitizationefforts[2]orbecausecontentisincreasinglyborndigital.LibrarieshavelongmadeamissionofprovisioningaccessservicestodigitalcontenttoenrichandimprovethelivesofallAmericans[3].Whendigitizedcollections(ofletters,governmentpapers,videoclips,institutionalrecords,annotatedvolumes)haveaccessrestrictions,however,provisioningservicesbecomesachallenge.Collectionscanhaveaccessrestrictionsforanumberofreasons:asetofpapersthathavenotbeenproperlyaccessioned;acollectionofvideoswithmixedin-copyrightandpublicdomaincontent;materialdonatedbyaprominentresearcherthatcontainssensitiveinformationfromethnographicstudiesonaboriginalpeoples.Thedata-sidepushfornewservicestomeetthechallengeofrestrictedandsensitivecollectionsisbeingmetwithacorollaryenduserpull,asresearchersandeducatorsdiscoverthepotentialthatBigDatatechniquesbringtothehumanities[4]andotherareas,andbegintoenvisionopportunityintheirownresearchspheretotheexplorationofbothsmallorlargecollectionsofmaterialscomputationallyfororganizing,accessing,andanalyzingcontent.

Traditionaltypesoflibraryservicesofteninadequatelyaddressenduserneedswhenacollectionofmaterialsisrestrictedordeemedtocontainsensitivedata.Securedataenclavepilotsallowresearcherstoworkwiththisuniquetypeofdata[5]–[9].Yetsuchenclavesoftenarelimitedtoanalysisofmicrodatathroughcommonstatisticalpackages,makingthemless-suitedforotherusesastherearehundredsofdifferentcomputationalcontentminingtools,forexample,thetextanalysisportalTAPoRlists493ofthem[10].Additionally,enclavesarefrequentlycustom-builtforacollection,orasmallsetofcentrallylocatedcollections,makingthissolutionnotsoeasilyportabletonewinstitutionsorcollections.

Drawingonthemostpressingthemesoftrust,access,infrastructure,andskillsinprovidingdataservices[11],theoverarchinggoalofthisprojectismanifold:understandcurrentlibraryneedsandpracticesinprovisioningservicesforcomputationalaccesstospecialcollections,extendanexistingservicetoenableintuitiveandyetsecurecomputationalaccesstorestricteddatainlibraries,andidentifygapsinskillsneededforlibrarianstoenablesecuredataanalyticsandprovideresourcesthatcanaddressthosegaps.WeaimtobuilduponaservicethathasbeendevelopedintheHathiTrustResearchCenter(HTRC)thatenablesenduserstoremotelyaccesstheHathiTrustDigitalLibraryforcomputationaluse.Wepropose,aspartofthisgrant,topackagetheserviceasanappliancesothatitcanbeeasilyinstalledinalibrarytechnologicalenvironment,andextendtheservicetosatisfyscenariosofdifferentcollectionsandenduserneedsdrivenbyourlibrarypartners.TheserviceiscalledDataCapsule[12],[13],anditderivesfromtheoreticalworkonaconceptcalled“storagecapsules”[14].ThroughagrantfromtheAlfredP.SloanFoundation(2011-2015)theauthorofstoragecapsules,AtulPrakash,alongwithPlaleandMcDonald(lattertwoareleadsonthisproposal)developedthestoragecapsuleconceptintotheworkingDataCapsuleservice,whichbecameavailableinHTRCin2015.TheserviceinHTRCutilizesatoolcalledtheWorkset[15],whichmaintainsanenduser’scontext.

Page 3: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.2

BuildingontheearlierworkoftheDCservice,weproposetoextendandevaluatethesystemundertheencompassingframeworkofParticipatoryDesignwithlibrarypartnersfromeightlibrariesacrossthecountrywhohavecommittedtoservingaseitherbeingLevel1testingpartners,eagertoengageinhandsonevaluation,orLevel2partners,readytoparticipateindiscussionsandstudies.Wepropose,throughthisParticipatoryDesignframework,toextendtheserviceto:

• Bepackagedasanappliancethatcanberunandmanagedlocallyatpartnerinstitutions• GeneralizetheDataCapsuleservicetoconnecttobroadertypesofrestrictedcollections• DeliverextensionstotheDataCapsuleserviceandWorksetmodelthatreflectpartnerneedsobtained

throughintensepartnerengagement• DeliveradesignofDataCapsulethatutilizeshighperformanceandcloudcomputingresourcesthat

accommodatesbothlarge-scaleneedsofpartnersandpartnerswithlightertechnologyresourcesavailabletothem

AstheDataCapsuleserviceisarchitectedusingprinciplesofwelldefinedAPIsandsoftwarecomponentmodularity,itishighlysuitedtoextensionandgeneralizationforthebroaderuse.

TheconceptualframeworkguidingthearchitectureofDataCapsule(DC)initscurrentformcanbeexplainedinthecontextoffairuse.Legaljudgmentsoffairusehaverepeatedlyreturnedtotwokeyanalyticalquestions[16]:First,“didtheuse“transform”thematerialtakenfromthecopyrightedworkbyusingitforabroadlybeneficialpurposedifferentfromthatoftheoriginalordiditjustrepeattheworkforthesameintentandvalueastheoriginal?”Andsecond,“Wasthematerialtakenappropriateinkindandamount,consideringthenatureofthecopyrightedworkandoftheuse?”InDC,thetransformingworkiscarriedoutbyanenduserwithinaCapsulethattheyhaveattheirdisposalforuseforanextendedperiodofweekstomonths.Theservicethenenforcesbothquestionsasfollows:

• Useisappropriate:theDCserviceassessesappropriatenessofthecontentexportedfromCapsule:o Unintentionalexportationsuchasthroughmalwareisstoppedo Intentionalexportationisreviewedthroughmanual(orinfutureautomatic)resultsreview

• Amountofdatausedisappropriate:theamountofdatausedincreationofallexporteddataproductsisbelowathresholdofappropriatenessofuse

• Datatypes:thetypeofdatausedinthecreationofnewcontentisallowablefortheneed• Intentisreasonableandidentityisproven:throughstructuresofpolicyandinstitutionalinfrastructure• WhenaCapsuleisusedforanalyticalpurposes,acceptableactivitiesincludebutarenotlimitedtoa)

imageanalysisandtextextraction,b)textualanalysisandinformationextraction,c)linguisticanalysis,d)automatedtranslationandlanguagetranslation,ande)indexingandsearch.

DataCapsulethusenablestransformativeuseofrestrictedandsensitivecollectionsthroughaservicethatwillbepackagedasanappliance,willhaveoptionsforhookingtoanewcollectionwithrelativeease,andprovidestheneededassurancesthattheactionsallowablebytheservicewillprotectthecollection.

Page 4: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.3

2.ProjectDesign

Theprojectisstructuredtobringtogetherthreedistinctcomplementarybodiesofexpertise:humancomputerinteractionexpertiseincommunityengagement,participatorydesign,andsocial-technicalinteractions(Kouper);computerscienceandtechnologyexpertiseindata-drivenarchitectures,datamodels,andtrust(Plale,McDonald,andDownie),andlibrarypartnerswithexpertiseintechnologyservicesforspecialcollections(MitchellandUnsworth).Themultidisciplinaryteamiscriticaltobringaboutaprojectofthisnature.

Thelibrarypartnershipisdesignedattwolevels.Level1TestingPartnersidentifyacollectionandanend-userneed,andworkwiththeDataCapsuleteamtoimplementaproof-of-conceptdemonstrationforthecollection.Level1TestingPartnersalsoparticipateintheassessment,userstudy,andparticipatoryactivities.TheyincludethelibrariesofUniversityofCaliforniaBerkeleyandUniversityofVirginia.Level2Partnersengageintheassessmentanduserstudy,andcontributetoparticipatoryactivities.Level2partnersincludethelibrariesofLafayetteCollege,IndianaUniversity,MIT,Rutgers,Swarthmore,andUCLA.

2.1Goals,methods,assumptions,andrisks

Thebroadgoalofthisprojectwillbeaccomplishedthroughsynergisticandmutuallyreinforcingactivityinitstwomajorfociofexpertise:inparticipatory,design-orientedpartnerengagementandinsoftwarearchitectureandevaluation.Thenatureoftheprojectisiterativewithinandbetweenthetwofociofexpertise:“explore,approximate,andrefine”[17].

Researchmethodologies:Theprojectwillemployresearchmethodologiesfromboththedomainsofhuman-computerinteractiontoaccomplishthegoalsassessment,partnerengagementandevaluation,andexperimentalcomputersciencetoadvancetheDataCapsuledesignandWorkset.Thismulti-methodapproachtoresearchisincreasinglyimportantinsuccessfultechnologyadoption:activeall-stakeholderengagementattheearlystagesensuresagoodfitonthehumancapitalside,andtheexperimentalcomputerscienceensuresagoodfitonthetechnologicalside.Themethodologiesofeacharedescribedinmoredetailbelow.

Projectrisks:Lowlibrarypartnerparticipationisapotentialprojectrisk.Weaddressedthisriskduringdevelopmentofthefullproposalbydevotingsubstantiallymoreresourcestothelibrarypartners.Weincreasedthenumberoflibrarypartnersintheprojectfrom3-5to8,andintroducedthetwo-tieredpartnermodel.Level1partners(2)receivefundingthroughasubcontractthattheyuseforengagementoftechnicalorcollectionsexpertise.WeadditionallybuiltfundingintotheIndianaUniversitybudgettofundtravelforLevel2partners(6)toparticipateinaregionalcommunity-buildingevent.Thechangeresultedinanincreaseintheoverallbudgetofabout15%fromthepre-proposal.Wethoughtthisactionanecessaryriskmitigationstrategy.Ourprojecthasalreadybuiltintoitaprogramforconstantsupportandinteractionwiththelibrarypartnersonbothlevelstoensurethehighestpossibleparticipation.

Assumptions:Ourprojecthasseveralassumptions,allofwhichwethinkarereasonableexpectationsintheenvironmentsofmajoracademiclibraries,thoughfurtherstudywillbecarriedoutforlesswell-equippedlibraries.DataCapsuleisanenvironment(asetofsoftwareservicespluspolicies)thatutilizesaclusterof

Page 5: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.4

computerslocatedwithinasecurenetwork.ThecodebaseismodularandutilizesApplicationProgrammingInterfaces(APIs)forextensibilityandinteroperability.ADataCapsuleControllerrunsononeoftheclusternodes.Fromthere,itallocatestoanenduseraCapsule--avirtualcomputer(virtualmachine)thatrunsononeoftheothernodesinthecluster.TheDataCapsuleserviceimplementationwillbeextendedinthisprojecttoutilizetheoperatingsystemlibrary,Libvirt1,whichallowsDCtoconfigureanenduserCapsuleforsecureaccess.Ourimplementationplanthusassumesi)Level1testingpartnerssupporttheexistenceofalibrarysuchasLibvirtrunningontheirtestingservers,ii)programmaticaccesstoacollectionisavailablethroughanAPI,andiii)thereexistsatrustedserviceinthelibraryenvironmentthroughwhichuserauthenticationcanbecarriedout.

ResearchFramework1:TheframeworkofParticipatoryDesign(PD)informstheresearchquestionsandmethodologiesofthehuman-computerinteractionresearch.Atheoreticalframeworkandasetofpractices,PDexploresconditionsfordeepuserengagementinthedesignandimplementationofcomputer-basedsystemsatwork[18].Userempowermentanddemocraticdecision-makingarecrucialforsuccessfulPDasoneofthemainassumptionsisthattechnologyisbeingdesignedtofacilitateskilledworkandenhanceratherthancompletelyreplacehumanlabor[19].Librariesrecognizetheneedtoengagetheirendusersinthedesignoflibraryspacesandtechnologies[20],[21].Weraisethequestionsofhowlibrariansthemselvescanbeinvolvedinco-designoftoolsthatuseandenhancetheirskillsets,while,atthesametime,enablelibraryendusers.

ResearchFramework2:Experimentalcomputerscienceasadisciplineandmethodologyformstheframeworkforassessingandadvancingthetechnologicalaspectsoftheproject.Throughiterativedesignandprototyping,wereflectuserneedsinthesoftwaredevelopmentprocess.Throughcarefullycontrolledcomparativeevaluationstudiesthataredesignedtoincludeperformanceevaluation,weaccuratelyassessdifferenttechnologicaltradeoffs.Thesestudies,whichareofaqualitysoastobepublishedinarchivalvenues,contributetothediffusionoftheprojectresultsmorebroadlythroughlibrariesandthroughtime.

DataCapsuleisanenvironmentthatutilizesaclusterofcomputerslocatedwithinasecurenetwork.Capsuleshavetwomodesofrunning:anopenmodeduringwhichausercanuploadtools,data,andsoftwareoftheirchoice.Duringopenmode,accesstotherestrictedorsensitivedataisblocked.Inthesecondmode,aclosedmode,allaccesstotheInternetisblocked,andthechannelstotherestricteddataareopened.Thisiswherethetoolsthatneedtoworkwiththesensitivedatacanbestartedup.Uponcompletionofatask,theuserstorestheresultstheywishtoexporttoaspecialdirectory,wheretheyarequeuedformanualreview,and,uponsuccessfulreview,theuserissentaURLfromwhichdownloadcanoccur.

TheexistingDataCapsulesystemwillbemigratedtoutilizetheLibvirtvirtualizationtoolkit.TheDataCapsuleControllerisdeliveredaseitheravirtualmachineimageormultipleDockercontainers,togetherwithasetofconfigurationfilesforpartnerstocustomizefortheirparticularenvironment.TheDataCapsuleControllerexpectstwocommunicationendpointsfromthepartnersite:APIsandcorrespondingSDK/toolkitthatcansecurelyaccessthedatacollectiontobeusedfromcapsules;andatrusteduserauthentication/authorizationinformationrelaytotheDataCapsuleController.LibvirtdaemonsarerequiredtoberunningonallData

1ThevirtualizationAPI:https://libvirt.org;runsonLinux,Windows,OSX,FreeBSD

Page 6: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.5

Capsulehostingservers.TheDataCapsuleControllerwillprovideRESTfulAPIsandabasicadministrationdashboardforpartnersitetobuildcustomizedfront-enduserinterface.AseparatedatabaseisneededtostorestatusofCapsulesandtheiractivities,aswellasusercomputationresultsforthewholesystem.TheDataCapsuleControllerisexpectedtoberatherlightweighttorunasasingleVM.TheDockercontainerapproachcouldprovidefurtherflexibilityofpackagingcomponentsandlesssystemresourceconsumption,albeitbemorecomplicatedtodeploy[22],[23].

OneoftheimportanttoolsintheDataCapsuleenvironmentistheWorkset.Asrestrictedcollectionscannotbemovedoutsideoftheirsecurestorageandprocessingenvironment,usersneedamechanismtosaveapersistentcontextoftheirsourcesthatholdsinformationaboutthestateoftheiractivities.HTRCusesthenotionoftheWorkset-amachine-actionablepersonalresearchcollectiondescribedusingtheResourceDescriptionFramework(RDF)thatconsistsofreferencestodigitalobjects(e.g.,volumes,pages,andsoon)andmetadata[18].TheWorksetmodelcombinespointersto,andmetadataabout,thegeneratedresourcesanditsselectionproceduresaswellasmetadataaboutbibliographicresourcesthatwentintoitscreation.Itprovidescontextandcontinuitythroughtheresearchlifecycle,fromitsconceptionandcreationtoarchiving,citation,andusebyotherresearchers.

Theresearchquestions/issuesthatweproposetoinvestigateare:

● Whataretheusesofrestrictedcollectionsinthecontextofdeliveringcomputationalanalyticalservices?Howdocollectionprovidersandusersconstructtheirneedsoftransformativeusesofthecollection?

● Howdocollection-specificservices,policiesandusesaffectthedesignofDC,andhowcanDCappliancefitwithinthelibraryanditstechnologicalandorganizationalmodels?Howdodifferentlypositionedactorswithinanorganizationinfluencethat?

● QuantifytheperformanceimplicationsofcertaindesigntradeoffsinextendingandgeneralizingtheDataCapsulesystemtomeettheneedsofabroadsetoflibraryusesandenvironments.

○ Includeinthestudyanassessmentoftradeoffswhenconsideringlibrarieswithlesswellequippedtechnicalinfrastructures

● EvaluatethetradeoffsforextendingtheDataCapsulesystemtoallowuserCapsulestoutilizehigh-performancecomputeresourcesinsideorexternaltoaninstitution,andrunlargeanalysistasks.

● EvaluatethedifferentmodelsforWorksetuseintheCapsulefordifferentuseandcollectionneeds.

2.2Specificactivities

Element1:Assessment

Workwithpartnerstomapoutcollectionspecificsandthecontextsoftheiruse;prioritizeneedsinco-designandimplementation;organizeeventstobringparticipantstogetherasacommunity.Employparalleltheoreticalreflectionandcontinuousexchangeofknowledge.

Page 7: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.6

Tasks

• Researchteaminterviewspartnerstogatherinformationaboutcollectionsandthecontextoftheiruses,identifiescollection-specificcharacteristicsaswellasworkpracticesthatmayimpactdevelopmentandimplementationofDC.Accessrestrictions,storage,security,andanalyticalneedsaswellastherelationshipsbetweencollectionusers,stewards,andtechnicalsupportwillbeincluded.Userneedsasseenbylibrariansortakenfrompreviousfeedbackofactualusers(e.g.,typesofdataanalysis,toolsused)willalsobeidentified.

• ExaminepoliciesandotherfactorsthataffecttheuseofrestricteddataandDC.Collectandanalyzedocumentsthatgovernaccessanduseoftherestrictedcollections.

• Organizecommunity-buildingeventspossiblyco-locatedwithregionalHTRCUnCampeventstoincreaseparticipation;organizeregularinformation-sharingsessions.

Outcomes

• Effectivecoordination,sharing,andnetworkingwithallpartners• Taxonomicknowledgeaboutrestrictedcollectionsandtheirpoliciesandcontextsofuse• Emergingsenseofcommunity• Communitybuildingmeetings

Element2:PartnerEngagement

Engagethetechnicalteam,Level1testing,andLevel2partnersinclosecooperation.Level1testingpartnerseachhaveaninstallationofDataCapsuleonanexperimentalsetofmachinesoftheirchoice.

Tasks

• TechnicalteamandLevel1testingpartnersengageinmutualexchangeaboutcollectionconstraints,infrastructureconstraints,technologyoptions,andsolutionsforprototypedemonstrationswithpartnercollections.Carryoutcontinuousinstallation,evaluation,andfeedbackcyclestorefine.

• EngagelibrarypartnersinParticipatoryDesign.Participatoryactivitiesandevaluationofappliance,whichwillincludedemoofDataCapsuleprototypeandWorksetreflectingco-designedfunctionality;installationofDataCapsuleatLevel1partners;continuousinstallofextensionsatLevel1partners,evaluationofimprovementsforallpartners.

• VisitworkplacesofLevel1and2partnersforpurposesofinformationexchange,assessmentandlearning.

Outcomes

Page 8: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.7

• Sharedknowledgeandunderstanding• Participant-influenceddesignoftechnologies• Betterfitoftechnologytoneeds• Loweredbarrierstoadoptionforpartners

Element3:DataCapsule

ExtendexistingDataCapsuleservicetoenableintuitiveandyetsecurecomputationalaccesstorestricteddatainlibraries.Evaluateextensionsthroughdemos,prototypedfunctionality,andevaluativestudies.

Tasks

• Design,developarchitectureforpackagingDataCapsuleasanappliance• Extenddatacapsulesystem’sarchitectureto

i) Enforceproperaccessofrestrictedandsensitivecollections,ii) Supportaccesstomultiplecollectionshavingdiverseformatsandtypes,iii) Supportrangeofusemodelsneededbypartners.Implementselectivechangesinform

ofprototypedemoforfeedback.• DesignevaluativestudyofDCascapableofutilizinghighperformanceorcloudcomputing

resourcestoserveinstitutionswithvariousresourcesincludinglessequippedinstitutions.Carryoutperformanceexperimentsevaluatedifferentdesigntradeoffs

Outcomes

• ExtendedcodebaseofDataCapsulepackagedasanappliancewithsupportfornewcollectiontypesandusecases.Codebasereleasedwithappropriateuseranddeveloperdocumentation.

• PublishedproofofconceptstudyofhowDataCapsulecanbescaledtouselarge-scalecomputeresourcesataninstitutionoratacloudprovidersuchasAmazonWebServices

• Publishedstudyofdesigntradeoffsinenhancementstosupportnewusecasesandaccessmodestorestrictedandsensitivecollections

Element4:Workset

EvaluateWorksetswithinthecontextoftheproject’snewusesanduserstoimprovetheutilityandimpactofWorksetsinthescholarlyresearchprocess.

Tasks

• Participateinassessmentandparticipatoryactivitiestogatherinformationaboutthe

Page 9: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.8

applicabilityofthecurrentWorksetmodeltospecificcollections.• DesignandcarryoutstudythatevaluatestradeoffstoextensionofWorksetmodeltoaccommodatethenewusesofDataCapsulesforcomputationalaccesstorestrictedandsensitivecollections.

• BringWorksettostatetoparticipateindemosshowcasingnewDataCapsulefunctionality• ActivelyengagelibrarypartnersinexploringhowbesttoeducateusersonoptimalpracticesforWorksetuseandreuse.

Outcomes

• Educationalmaterialsforaresearcher’sbestutilizationoftheWorksetnotioninthedistantanalysisthatthisprojectenables

• PublishablestudyofdesigntradeoffsforextendingWorksettoadditionalcollectionsanduses

2.3Projectmanagement

TheprojectwillbeledbyBethA.Plalewithdirectoversightandresponsibilityforprojectsuccess.Dr.KouperandRobertMcDonaldwillserveasco-Directors.TheleadershipteamincludingJ.StephenDownieatUniversityofIllinoiswillmeetweekly,andbejoinedonceamonthbytheLevel1TestingLibrarypartners.DecisionmakingwithiscarriedoutthroughconsensusbuildingwiththefinaldecisionrestingwiththePD.

Dr.Plalealsobringstechnicalexpertise,andinthisPlalewillworkcloselywithDr.Yu(Marie)Ma,Dev/OpsmanagerofHathiTrustResearchCenter,toensurethatthetechnicalstaffmembersaretaskedappropriatelyfortheprojectneedsandtimelines.Dr.InnaKouperwillleadtheprojectassessmentandcommunitybuildingactivitiesusingParticipatoryDesignmethodsandcarriedoutincollaborationwithpartnerlibraries.RobertH.McDonaldwillcoordinatethepartnerlibraries.Level1partnerlibrarieswillsuperviseprototypingandtestingofdigitalcollections.J.StephenDowniewillcoordinateexpertiseontheWorkset.

Bi-weeklyvideoconferencingmeetingscarriedoutforcommunitybuildingwillbeheldusingtheZoom.usconferencingsystemthatIUprovidesfreetoitsresearchgroups.TechnicalcommunicationwithLevel1(andlevel2asinterested)partners,whichtendstobefrequentandshortduringjointefforts,willutilizeaSlack.comchannel.Stakeholderinteractionswillbeviaregularteleconferencesandphonecalls.Userstudieswillbeconductedonlineusingscreen-sharingandrecordingtoolssuchasZoominadditiontoin-personvisits.

IssuesraisedbylibrarypartnersneedingimmediateattentionoftheDataCapsuleandWorksettechnicalteamcanutilizetheHathiTrustResearchCenterservicedeskbuiltontheAtlassianJiraServiceDeskandbugtrackingsystem.Softwaredevelopmentandprojectmanagementcomputers,grantsmanagementstaff,andofficespaceneededfortheeffortatIndianaUniversityareprovidedbytheDataToInsightCenter.Theotherfundeduniversitieswillprovidesimilarresourcesneededforaccomplishingtasks.WewillutilizecomputerresourcessuchasAmazonWebServicesasneededfortesting.

Page 10: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.9

Asthisisaresearchgrant,evaluationandperformancemeasurementsarebuiltintotheoutcomes.Thatis,publishedresultsareamongsttheplannedoutcomes.ThefindingsfromassessmentandParticipatoryDesignwillbesharedanddiscussedwithdeveloperandlibrarianteamsduringregularmeetings.Ongoingfeedbackwillbeincorporatedintothefindings.

2.4Projectdisseminationandsustainability

Recommendationsfromthisprojectcanbeadoptedindiverselibrarysettings;thesurveysandcommunitybuildingeffortscanbringtogethermanystakeholdersindata,includingresearchers,librarians,universityadministrators,andfundingagencies.Resultsoftheprojectwillbedisseminatedthroughmultipleprofessional,academic,andsocialmediachannels.

Communitybuildingisakeypartoftheproject.CommunitybuildingusermeetingsfromthisprojectwillbeconsideredtobecomepartoftheregularHTRCUnCamps--hybridconference-workshopeventsalreadyapartofHTRC’scommunityengagementplan.ChangestotheDataCapsulecodebaseundertakenduringthisprojectwillbecommittedbacktoanewprojectbranchoftheexistingDataCapsulecoderepository(https://github.com/htrc/HTRC-DataCapsules).AsanintendedoutcomeoftheParticipatoryDesignframeworkofthisproject,librarypartners,especiallyLevel1partners,willbeactivelycontributingtothecodebranchbytheendoftheproject.Thiswillcreateabroadercommunityaroundthecodebase,thusgivingastrongfoundationforitssustainability.ThechangestotheDataCapsulessystem,includingtheWorkset,areanticipatedtoalsobenefittheinstancerunningintheHathiTrustResearchCenter,creatinganotherpillarinthefoundationofsustainabilityfortheframework.

3.NationalImpact

Theproposedprojectwillhavenationalimpactthroughi)provisionofaportablesolutionforaccessingrestrictedandsensitivecollections,ii)fosteringacommunityandincreasedcollaborationaroundthetechnical,organizational,andpolicychallengesofprovidingcomputationalaccesstorestrictedcollections,andiii)amplifyingprojectoutcomesthroughtheconnectiontoHathiTrustConsortiumanditshundredsofmemberlibraries.Ourportablesolution,onceinshareableform,canbereusedbyotherlibrariesaroundthecountry,whereexpertscanimprovethecodeanddocumentationaswellasdigitalcurationactivities,andworkwiththeiruserstodevelopnewrequirementsandmaterialstouserestricteddigitalcollectionsinresearchandteaching.AnemergingcommunitywillbecomepartofthelargerHathiTrustcommunityandwillcontinuestimulatinglibrariesandresearchandnon-profitorganizationstojoinforcesinfurtherdevelopmentandmutuallearningandsupport.Astrongsenseofcontributionandcollaborationaroundcommunity-sustainedsoftwarewillhelptohavealong-lastingimpact.

Addressedneeds:Throughitsdevelopmentandparticipatoryactivities,thisprojectwillbroadenaccesstodigitalcollectionsthatexistinlibraries,includingpapers,letters,video-materialsandmanyothers.Itwillnotonlyestablishacommunitydedicatedtoworkingonsolutionsforrestrictedcollections,butalsodevelopastrongfoundationformotivatingandengagingfuturegenerationsoflibraryexpertsindevelopinginnovative

Page 11: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.10

softwareandservices.Projectoutcomeswilladdressthelibraryneedsofprovidingscalabletoolsforworkingwithdigitalcollections,whilerespectingprivacy,copyright,andconfidentialityrestrictions,andcontributetobuildingtheNationalDigitalPlatformasadistributedsetofsoftwareapplicationsandprofessionalexpertisethatprovidelibrarycontentandservicestoallusersintheUS[24].

Inadditiontoprovidingastrongprototype,wewillhelptrainlibrariansandprofessionalsinvolvedindevelopingtechnologyviasupportfromandcollaborationswithourtechnicalteamandviatargetedcommunityevents.Wewillsupportcommunitiesofpracticeandstrengthenlibrariesaspartnersinaddressingtheresearchandscholarshipneedsofcomputationalresearch.

Resultingproducts:ThisprojectwillresultinthetangibleproductsofextensionstotheexistingcodebaseforDataCapsule,toguidelinesandeducationalmaterials,andpublications.Theintangibleproductiscommunitybuy-intowardsadoptionandcommunityinvolvementinongoingcontributionstotheDCcodebase.Thetangibleproductsenableproliferationofexperienceandfactsbeyondtheimmediatelibrarypartnerstoincreasedadoption.Publications,forinstance,areatangibleoutcomethatfacilitatestrustintechnologyandhumanwork.Researchisgroundingforassessmentsofuse.

Sustainingthebenefit:Thesustainabilityofthebenefitsoftheproposedactivityextendswellbeyondtheperiodoffunding.Itisanimportantpointthatthisactivitywillvaultanexistingandsuccessfulserviceintobroaderusethroughstudyandextension,andwilldosoinawaythatbuildsitsadopters(libraries)intotheprocessthusgrowingthesustainingcommunitythroughthegrantduration.

Growingadoptersandasustainingcommunityaroundthesoftwarecodebasecantaketime,likelymoretimethantheshortgrantduration.ThisriskismitigatedbecausetheserviceitselfisgroundedintheHathiTrustResearchCenter,whichstandsbehindtheDataCapsuleserviceasitsprimaryserviceforcomputationalanalysisonthenearly15millionvolumesoftheHathiTrustDigitalLibrary.HTRCdeeplywelcomesthisinitiativetoinvolvemorepartners.AsexpectedoutcomeofthisprojectistohavepartnersoutsidetheHTRCtechnicalteammakingcontributionstothecodebase,theHTRCcommitstoincorporatingthosechangesbacktothemainbranchoftheDataCapsulecodebaseandusetheextensionsinfuturereleasesofDataCapsuleforitsownandbroaderuse.

Page 12: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Scheduleofcompletion,p.1

ScheduleofCompletion

2017 2018 2019

Apr-Jun Jul-Sep Oct-Dec Jan-Mar Apr-Jun Jul-Sep Oct-Dec Jan-Mar Apr-Jun

Task

Award-May2017

Task/elementI:Assessment

Preparationforassessment

Assessmentofcollections,policiesandcontextsofuse

Preparationforcommunitybuildingevents

Communitybuildingevents

Carryoutpublishableanalysesofcollectedassessmentandparticipatorydesigndata

Supportstakeholder/communityinteractions

Conductonlineuserstudies

Publishtrainingmaterials

Publishresults

Task/elementII:Partnerengagementandevaluation

PlanDCinstall

Firstinstallintestenvironment

Partnercampusvisits

Guidedhandsonexperienceandcrossinstitutionlearning

Co-designandevaluationofappliance

DemoDCandworksetreflectingparticipatorydesignfunctionality

Continuousinstall,evaluationofimprovements

IntegrateprojectdevelopmentsintoDCcodebaseandrelease

Task/elementIII:Datacapsuledevelopment

Designforappliancearchitecture

Development:codechangestopackageasappliance

Usingfeedbackfromassessment,refinedesignplans

Carryoutpublishablestudythatevaluatesdifferentdesigntradeoffs

DesignevaluativestudyforDCasthinclienttoHPCresources

CarryoutdevelopmentstudyofDCasthinclient

EvaluateandintegratechangesinmainDCbranch

Publishresults

Page 13: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Scheduleofcompletion,p.2

Developandreleaseuseranddeveloperguides

Task/elementIV:Worksetstudyanddevelopment

Developstudyofworksetinthissetting

Conductstudyofworkset

Usingfeedbackfromassessment,refinedesignplans

Carryoutpublishablestudythatevaluatesdifferentdesigntradeoffs

Evaluateandintegratechangesinmainworkset/worksetbuilderbranch

Publishresults

Page 14: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 1 of6

LG-71-17-0094IndianaUniversityDataToInsightCenter

DIGITALPRODUCTFORM

Introduction

TheInstituteofMuseumandLibraryServices(IMLS)iscommittedtoexpandingpublicaccesstofederallyfundeddigital products(i.e.,digitalcontent,resources,assets,software,anddatasets).TheproductsyoucreatewithIMLSfunding requirecarefulstewardshiptoprotectandenhancetheirvalue,andtheyshouldbefreelyandreadilyavailableforuseand re-usebylibraries,archives,museums,andthepublic.However,applyingtheseprinciplestothedevelopmentand managementofdigitalproductscanbechallenging.Becausetechnologyisdynamicandbecausewedonotwanttoinhibit innovation,wedonotwanttoprescribesetstandardsandpracticesthatcouldbecomequicklyoutdated.Instead,weask thatyouanswerquestionsthataddressspecificaspectsofcreatingandmanagingdigitalproducts.LikeallcomponentsofyourIMLSapplication,youranswerswillbeusedbyIMLSstaffandbyexpertpeerreviewerstoevaluateyourapplication, andtheywillbeimportantindeterminingwhetheryourprojectwillbefunded.

PARTI:IntellectualPropertyRightsandPermissions

A.1 Whatwillbetheintellectualpropertystatusofthedigitalproducts(content,resources,assets,software,ordatasets) youintendtocreate?Whowillholdthecopyright(s)?Howwillyouexplainpropertyrightsandpermissionstopotential users(forexample,byassigninganon-restrictivelicensesuchasBSD,GNU,MIT,orCreativeCommonstotheproduct)? Explainandjustifyyourlicensingselections.

Theformalproductsproducedasoutcomeofourproposedeffortaresoftware,trainingmaterials,useranddeveloperdocumentation,andstudies.Weanticipateintermediateproductsemergingaswellintheformofdatasetsderivedfromtestingoftheconnectionstorestrictedandsensitivecollections.Theformalmaterialsandsoftwareproductsresultingfromthiseffortwillbelicensedusingopenandfreelicensing,e.g.,CreativeCommonsandApache2.0-stylelicenses,followingthebestpracticeestablishedbytheHathiTrustResearchCenter(HTRC).Intermediateproductsemergingasaresultoftestingandexperimentationwillbediscardedbytheendoftheprojectlife.WhileoperationaluseofaDataCapsuleserviceatapartnerinstitutionisnotanticipatedoverthecourseoftheproject,shoulditoccur,orshoulduseofHTRC’soperationalDataCapsuleservicebeusedfortraining,thenthedataproductsemergingfromenduseruseofaCapsulewillfollowtheHTRCpolicyofnotimposinglicensingrestrictionsontheproductsassumingthattheDataCapsuleservicethattheenduserisusingisfullyoperationalandthedataproductspassthereviewprocess(runbyHTRC).Iftheconditionsarenotmet,thedataproductsareconsideredintermediateproductsandwillbedestroyedbyendofprojectlife.

A.2 Whatownershiprightswillyourorganizationassertoverthenewdigitalproductsandwhatconditionswillyouimpose onaccessanduse?Explainandjustifyanytermsofaccessandconditionsofuseanddetailhowyouwillnotifypotential usersaboutrelevanttermsorconditions.

Softwareproductsdevelopedinthisprojectwillbeopenlysharedandaccessibleviaanopensoftwarerepository(Github).AstoaccesstotheDataCapsuleservice,duringthecourseoftheprojecttherewillbetestinstancesofDataCapsuleservicerunningattheLevel1librarytestingpartnerinstitutions,andanoperationalinstancerunningatIndianaUniversityaspartofHTRC.WeanticipatethetestinstancesofDataCapsuleservicehavingnoend-userusesduringthecourseoftheprojectastheywillbeunderdevelopment.TrainingwillbecarriedoutontheoperationalHTRCinstanceoftheDataCapsuleservice.

A.3 Ifyouwillcreateanyproductsthatmayinvolveprivacyconcerns,requireobtainingpermissionsorrights,orraiseany culturalsensitivities,describetheissuesandhowyouplantoaddressthem.

Aspartofthisproject,wewillbeconductinginterviewsandtakingnotesduringethnographicobservations.Thedatacollectedviainteractionswithhumansubjectswillbestoredsecurelyandaccessed by projectinvestigators only. Such datawill be shared only after appropriate anonymization or

Page 15: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 2 of6

LG-71-17-0094IndianaUniversityDataToInsightCenter

withexplicitconsentfromparticipants.Additionally,restrictedcollectionsthatwillbeusedduringtestingincomputationalanalysisinDataCapsulesmayraisecopyright,privacyorotherconcerns.Theseconcernedwillbeaddressedthroughpolicydiscussionswithlibrarypartners;thesediscussionsmaybeguidedbyHTRC’spolicydevelopedtoaddresssimilarconcerns.

PartII:ProjectsCreatingorCollectingDigitalContent,Resources,orAssets

A. CreatingorCollectingNewDigitalContent,Resources,orAssets

A.1 Describethedigitalcontent,resources,orassetsyouwillcreateorcollect,thequantitiesofeachtype,andformatyou willuse.

Inthecourseofthisprojectthefollowingdigitalcontentwillbecreated:

1. ExtensionstoDataCapsuleservice.TheextensionswillstartfromtheexistingHTRCcodebase,whichisorganizedinapprox.50modules.Itisexpectedthatmodificationswilltouch10-20%ofthecodeforpartnercustomization.2. EnhancementstotheWorksetmodel.ThisresourceisanOntologythatcanbeexpressedinRDFand/orXMLformats.Enhancementswillcompriseabout10%oftheresource.3. Interviewrecordingsandtranscriptsandfieldnotes.SeePartIVDatasetsformoredetails.4. Onlinemanualsandtrainingmaterials.Installation,testinganduseofDataCapsulewillbedocumentedinonlinemanualsandtrainingmaterials,whichwillbeopenlyaccessibleviatheweb.5. Publicationsandpresentations.Findingsfromtheprojectwillbedisseminatedviajournals,conferences,andothervenues.PDFdocumentsandslideswillbeopenlysharedwiththecommunity,unlesspublishingrestrictionsapply.

A.2 Listtheequipment,software,andsuppliesthatyouwillusetocreatethecontent,resources,orassets,orthenameof theserviceproviderthatwillperformthework.

Theprojectactivitywillbetodevelopsoftwareextensionstoexistingcodebasesandconducthuman-computerinteractionstudies.Activitydoesnotextendtothecreationofdigitalcollections.Weintendto use computers at IndianaUniversity, University of Illinois, University of Virginia, UC Berkeley, andUCLAfortestinganddevelopment.WeexpectLevel1partnerstohavetestserversavailableonwhichwewillinstallthesoftware(DataCapsule).

A.3 Listallthedigitalfileformats(e.g.,XML,TIFF,MPEG)youplantouse,alongwiththerelevantinformationaboutthe appropriatequalitystandards(e.g.,resolution,samplingrate,orpixeldimensions).

Softwarewillexistindevelopmentformats,predominantlyJavafiles,Pythonscripts,andXMLconfigurationfiles.PartnerlibrarieswhowillusetheoperationalDataCapsuleserviceatHTRCforanalyzingtheirrestrictedcollections,mayhavederivedproductsinotherformatsthatareappropriateintheirrespectiveuserdisciplines,suchastabularfilesorimages.Qualitystandardsforthosederivedproductsaswellasqualitychallengeswillbediscussedduringparticipatorydesignactivities.Softwarequalitywillbemonitoredandevaluatedbyusing"fitnessforpurpose"andstructuralanalysistechniques.

B. WorkflowandAssetMaintenance/Preservation

B.1 Describeyourqualitycontrolplan(i.e.,howyouwillmonitorandevaluateyourworkflowandproducts).

Fordetailsonsoftwarequalitycontrol,seePartIII.

TheassessmentiscarriedoutbyaPhDresearchfacultymemberwhoishighlytrainedincarryingoutqualityprocesses.Dr.Kouperhasastrongrecordofpublicationqualityresearchinthisarea.SoftwaredevelopmentwilluseHTRC’ssoftwaredevelopmentprocesses,includingoversightbyaDevOpsManager,helpdesk,andbugtracking.StudiesofDataCapsuleandWorksetwillbeunderthesupervisionofPlaleand Downie, both full professors and accomplished scholars in this type of work.

Page 16: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 3 of6

LG-71-17-0094IndianaUniversityDataToInsightCenterB.2 Describeyourplanforpreservingandmaintainingdigitalassetsduringandaftertheawardperiodofperformance. Yourplanmayaddressstoragesystems,sharedrepositories,technicaldocumentation,migrationplanning,and commitmentoforganizationalfundingforthesepurposes.Pleasenote:Youmaychargethefederalawardbeforecloseout forthecostsofpublicationorsharingofresearchresultsifthecostsarenotincurredduringtheperiodofperformanceof thefederalaward(see2C.F.R.§200.461).

Softwareproductswillbeshared,preservedandmaintainedusingtheopensoftwarerepositoryGithub.TechnicaldocumentationwillbestoredonGitHubaswellasontheopenHTRCwikipages.WewillencourageHathiTrustcommunityandtheemergingDataCapsulecommunitytofurthercontributetocurationandpreservationofthesoftware.Productsofresearch(publications,datasets,andpresentations)willbepreservedinIndianaUniversityinstitutionalrepositoryIUScholarworks,whichwillserveasanadditionalpreservationlayertotraditionalpublicationvenues.

C. Metadata

C.1 Describehowyouwillproduceanyandalltechnical,descriptive,administrative,orpreservationmetadata.Specify whichstandardsyouwilluseforthemetadatastructure(e.g.,MARC,DublinCore,EncodedArchivalDescription,PBCore, PREMIS)andmetadatacontent(e.g.,thesauri).

READMEfiles,useranddeveloperguidesaretheformofdocumentationusedtopreservesoftwaremetadata.FordatasetswewilluseDublinCoretorecorddescription,administrative,andpreservationmetadata.

C.2 Explainyourstrategyforpreservingandmaintainingmetadatacreatedorcollectedduringandaftertheawardperiod ofperformance.

Metadatawillbemaintainedaspartofthesoftwareanddatamaintenance,i.e.,itwillbestoredandmigratedalongwiththedigitalproducts.

C.3 Explainwhatmetadatasharingand/orotherstrategiesyouwillusetofacilitatewidespreaddiscoveryanduseofthe digitalcontent,resources,orassetscreatedduringyourproject(e.g.,anAPI[ApplicationProgrammingInterface], contributionstoadigitalplatform,orotherwaysyoumightenablebatchqueriesandretrievalofmetadata).

Astheprojectisnotconcernedwithcreatingadigitalcollection,wewillrelyonotherlargerresourcesforwidespreaddiscoveryanduse,includingHathiTrustResearchCenternetworks,academicpublishingdatabases,andsoftwareandinstitutionalrepositories.

D. AccessandUse

D.1 Describehowyouwillmakethedigitalcontent,resources,orassetsavailabletothepublic.Includedetailssuchasthe deliverystrategy(e.g.,openlyavailableonline,availabletospecifiedaudiences)andunderlyinghardware/software platformsandinfrastructure(e.g.,specificdigitalrepositorysoftwareorleasedservices,accessibilityviastandardweb browsers,requirementsforspecialsoftwaretoolsinordertousethecontent).

Softwareandstudyproductswillbeopenlyavailableonline,unlessthelatterisrestrictedbythepublishers.

D.2 Providethename(s)andURL(s)(UniformResourceLocator)foranyexamplesofpreviousdigitalcontent,resources, orassetsyourorganizationhascreated.

TheDatatoInsightCenterhasitsowngrouprepositoryonGitHubwhereallsoftwareproductsaremadeavailabletothepublic:https://github.com/Data-to-Insight-CenterMostrecentexamplesincludeDataMatchMakerhttps://github.com/Data-to-Insight-Center/Data-MatchMakerandPRAGMAData

https://github.com/Data-to-Insight-Center/PRAGMA-Data-Repository

Page 17: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 4 of6

LG-71-17-0094IndianaUniversityDataToInsightCenter

Additionally,D2IcontributionstoHTRCcodearemadeavailableviaseparaterepositoryhttps://github.com/htrc,wheretheexistingDataCapsulecodebasecanbefoundhttps://github.com/htrc/HTRC-DataCapsules.

PartIII.ProjectsDevelopingSoftware

A. GeneralInformation

A.1 Describethesoftwareyouintendtocreate,includingasummaryofthemajorfunctionsitwillperformandtheintended primaryaudience(s)itwillserve.

Toaccomplishthegoalsofthisproject,wewillextendtheDataCapsulesservicecodebase.HTRCDataCapsuleworksbygivingaresearcheravirtualmachine(VM)thatrunswithintheHTRCdomain.TheresearchercanconfiguretheVMastheywouldtheirowndesktopwiththeirowntools.Aftertheyaredone,theVMswitchesintoa“securemode”,wherenetworkandotherdatachannelsarerestrictedinexchangeforaccesstothedatabeingprotected.Currently,DataCapsuleworksonlywiththeHathiTrustDigitalLibraryandwithinHTRCarchitecture.Wewillgeneralizethearchitecturetoworkwithothercollectionsandevaluatedesign,secureaccessandscalabilityoptionstoworkinspecificlibraryenvironments.

A.2 Listotherexistingsoftwarethatwhollyorpartiallyperformsthesamefunctions,andexplainhowthesoftwareyou intendtocreateisdifferent,andjustifywhythosedifferencesaresignificantandnecessary.

ComparableconceptualframeworksthatintendtoperformsimilarfunctionsincludeDataEnclavesandStorageCapsules.DataEnclavesrelyoncustomizedvirtualizationsoftwareandpre-definedsetoftoolstoenableaccess.Tothebestofourknowledge,noworkingsoftwareexiststhataddressestheneedtoperformcomputationalanalysisondocumentsandresourcesusingaresearcher-definedsetoftools.Astheneedforcomputationalresearchonrestrictedcollectionsusingalargevarietyoftoolsgrows,thedevelopmentofsuchsoftwareisundoubtedlysignificantandnecessary.

B. TechnicalInformation

B.1 Listtheprogramminglanguages,platforms,software,orotherapplicationsyouwillusetocreateyoursoftwareand explainwhyyouchosethem.

DataCapsulesoftwareisinJava,Python,andshellscripts.

B.2 Describehowthesoftwareyouintendtocreatewillextendorinteroperatewithrelevantexistingsoftware.

ThesoftwareextendstheDataCapsuleservice.

B.3 Describeanyunderlyingadditionalsoftwareorsystemdependenciesnecessarytorunthesoftwareyouintendto create.

DataCapsuleusesopensourcevirtualizationinfrastructure(QEMUandKVM),whichneedstobeinstalledforthecapsuletowork.

MySQLrelationaldatabasesystemisusedtostorecapsulemetadataandresults.

DataCapsuleisprovidedforUbuntu(Linux)environment.

B.4 Describetheprocessesyouwillusefordevelopment,documentation,andformaintainingandupdatingdocumentation forusersofthesoftware.

ThecodewillbeforkedinGitHubrepository,creatinganewbranch.ContributingdeveloperswillbeusingtheirenvironmenttowritecodeandthencommitthecodebacktoGitHub.WewilluseHTRCdocumentationandbug-trackingservices(AtlassianConfluenceandJira)formaintainingandupdatingdocumentation for users of the software. Atthe end of the projectonline manuals will also be written.

Page 18: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 5 of6

LG-71-17-0094IndianaUniversityDataToInsightCenterB.5 Providethename(s)andURL(s)forexamplesofanyprevioussoftwareyourorganizationhascreated.

TheDatatoInsightCenterhasitsowngrouprepositoryonGitHubwhereallsoftwareproductsaremadeavailabletothepublic:https://github.com/Data-to-Insight-CenterMostrecentexamplesincludeDataMatchMakerhttps://github.com/Data-to-Insight-Center/Data-MatchMakerandPRAGMAData

https://github.com/Data-to-Insight-Center/PRAGMA-Data-Repository

Additionally,D2IcontributionstoHTRCcodearemadeavailableviaseparaterepositoryhttps://github.com/htrc,wheretheexistingDataCapsulecodebasecanbefoundhttps://github.com/htrc/HTRC-DataCapsules.

C. AccessandUse

C.1 Weexpectapplicantsseekingfederalfundsforsoftwaretodevelopandreleasetheseproductsunderopen-source licensestomaximizeaccessandpromotereuse.Whatownershiprightswillyourorganizationassertoverthesoftwareyou intendtocreate,andwhatconditionswillyouimposeonitsaccessanduse?Identifyandexplainthelicenseunderwhich youwillreleasesourcecodeforthesoftwareyoudevelop(e.g.,BSD,GNU,orMITsoftwarelicenses).Explainandjustify anyprohibitivetermsorconditionsofuseoraccessanddetailhowyouwillnotifypotentialusersaboutrelevanttermsandconditions.

WewilluseApache2.0licensetoreleaseDataCapsule.Thelicenseallowstoreproduceanddistributecopiesofthesoftwareanditsderivativeswithorwithoutmodifications.Thelicensetextisputtousebyaddingittotheheaderofasoftwarefile(seehttps://www.apache.org/licenses/LICENSE-2.0foracopyofthelicense).

C.2 Describehowyouwillmakethesoftwareandsourcecodeavailabletothepublicand/oritsintendedusers.

ThesourcecodeextensionstotheDataCapsulewillbemadeavailableviaGitHubhttps://github.com/htrcasaseparatebranchoftheprimarybranch.

C.3 Identifywhereyouwilldepositthesourcecodeforthesoftwareyouintendtodevelop:

Nameofpubliclyaccessiblesourcecoderepository:GitHub

URL:https://github.com/htrc

PartIV:ProjectsCreatingDatasets

A.1 Identifythetypeofdatayouplantocollectorgenerate,andthepurposeorintendedusetowhichyouexpectittobe put.Describethemethod(s)youwilluseandtheapproximatedatesorintervalsatwhichyouwillcollectorgenerateit.

Datawillbecollectedviaphoneinterviewsandethnographicobservations,whichinvolvenote-taking,recording,andphotographs.Phoneinterviewswillbeconductedatthebeginningoftheproject.Follow-upinterviewsandadditionalrecordingsofconversationsandnote-takingwilltakeplacethroughouttheprojectasaneedtodocumentparticipantinteractionswillarise.

A.2 Doestheproposeddatacollectionorresearchactivityrequireapprovalbyanyinternalreviewpanelorinstitutional reviewboard(IRB)?Ifso,hastheproposedresearchactivitybeenapproved?Ifnot,whatisyourplanforsecuring approval?

DatacollectioninvolveshumansubjectsandrequiresIRBapproval.IRBapplicationwillbepreparedandsubmittedwhen/iftheprojectisapprovedforfunding.

A.3 Willyoucollectanypersonallyidentifiableinformation(PII),confidentialinformation(e.g.,tradesecrets),orproprietary information?Ifso,detailthespecificstepsyouwilltaketoprotectsuchinformation whileyou prepare the data files for public release (e.g., data anonymization, data

Page 19: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 6 of6

LG-71-17-0094IndianaUniversityDataToInsightCenter

suppressionPII,orsyntheticdata).

Participantscanbeidentifiedinphoneinterviews,notes,andrecordings.PersonallyidentifiableinformationwillbestoredsecurelyandonlyPIandco-PIswillhaveaccesstoit.BeforepublicreleaseofthedatasetallPIIwillberemoved(participantswillbeassignedcodednumbersandanyinformationthatmayidentifythemindividuallywillbeobscuredintheinterviews,notes,andtranscripts).

A.4 Ifyouwillcollectadditionaldocumentation,suchasconsentagreements,alongwiththedata,describeplansfor preservingthedocumentationandensuringthatitsrelationshiptothecollecteddataismaintained.

Participantswillbeprovidedwithinformedconsentforms,whichtheywillsign.TheformswillbestoredsecurelyandseparatelyandtherelationshiptothecollecteddatawillbemaintainedviaastudyIDthatwillberecordedintheinformedconsentformsandinthedatafiles.

A.5 Whatmethodswillyouusetocollectorgeneratethedata?Providedetailsaboutanytechnicalrequirementsor dependenciesthatwouldbenecessaryforunderstanding,retrieving,displaying,orprocessingthedataset(s).

Thedatawillbecollectedviainterviewsandobservationsandwillconsistoftextfiles,audioandvideofiles,andphotographs.Commonwordprocessingsoftwareandmultimediaplayersmaybeusedtodisplaythedata.Processeddatamayconsistofadditionalspreadsheetsandvisualizations,whichwillbestoredinnon-proprietaryformats(e.g.,CSVorPNG).

A.6 Whatdocumentation(e.g.,datadocumentation,codebooks)willyoucaptureorcreatealongwiththe dataset(s)? Where will the documentation be stored and in what format(s)? How will youpermanentlyassociateandmanagethe documentationwiththedataset(s)itdescribes?

Codebookswillbecreatedaspartoftheanalysisofqualitativedata(e.g.,inthethematiccodingprocedurescodeswillbedevelopedintheinductivemanner,aftercloseiterativereadingoftheinterviews).Codes,theirdescriptionsandotherdocumentationthatdescribeswhenandwheretheinterviewsandobservationstookplacewillbestoredintextformatsalongwiththedata.Thedocumentationwillbeassociatedwiththedatasetsthroughconsistentfilenamingandthroughidentifiersthatrefertoeachdatacollectioneffortseparately.

A.7 Whatisyourplanforarchiving,managing,anddisseminatingdataafterthecompletionoftheaward-fundedproject?

ThedatawillbemanagedandarchivedusingScholarlyDataArchive(backed-upstorageforlong-termarchiving)andinstitutionalGoogleDriveatIndianaUniversity(foractiveworkwithdata).Folderswithappropriatepermissionsfordata,processingscripts,IRBdocumentation,andpublicationswillbecreated.Fordissemination,wewilluseIUScholarworksrepositoryandoneofthepubliclyavailablerepositories,suchasFigshareorMendeley.

A.8 Identifywhereyouwilldepositthedataset(s):

Nameofrepository:IUScholarworks;Figshare;MendeleyData

URL:scholarworks.iu.edu/dspace/;fighare.com;data.mendeley.com

A.9 Whenandhowfrequentlywillyoureviewthisdatamanagementplan?Howwilltheimplementationbemonitored?

PIswillmonitortheimplementationofthisdatamanagementplan.Theplanwillbereviewedevery6monthsandadjustedaccordingtotheamountsandtypesofdatagenerated.