building a public research center for the hathitrust digital library
Post on 18-Nov-2014
1.239 Views
Preview:
DESCRIPTION
TRANSCRIPT
June 14, 2011JCDL 2011: Big Data! Big Deal? Panel
Building a Public Research Center for the HathiTrust Digital Library
Robert H. McDonaldAssociate Dean for Library Technologies and Digital Libraries
Associate Director-Data to Insight Center, Pervasive Technology InstituteIndiana University
@hathitresearch | @hathitrust
http://www.hathitrust-research.org
HathiTrust Research Center (HTRC) Team
Indiana University Beth Plale – Director Robert McDonald – Executive Committee
University of Illinois Scott Poole – Co-Director John Unsworth – Executive Committee
HathiTrust Digital Library History To contribute to the common good by collecting, organizing, preserving,
communicating, and sharing the record of human knowledge. Launched in October 2008
University of Michigan Indiana University
Used Google Books Repository at Michigan as Model Expanded to include content from
CIC Member Libraries UC System Libraries University of Virginia
Now includes more than 50 partner institutions and more than 8 million volumes
Towards a HathiTrust Research Center Started in response to proposed Google Settlement -
June 2009 Specific Funding set aside by Google to build a public research
center Worked to identify key stakeholders from HT institutions to
collaborate and write RFP Google Settlement in early 2011 did not stop the center
Developed specific RFP for HathiTrust to solicit proposals – Summer/Fall 2009 HTRC RFP Working Group
RFP Released – Winter 2010
Our Collaboration
HTRC is founded as a joint venture between Indiana University and the University of Illinois Urbana-Champaign, aimed at solving the difficult challenges of increasing computational access to the public domain and copyrighted material in HathiTrust.
Our Mission Phase I : starting Apr 2011 and
going for 18 mos. Phase II : starting Fall 2012 and
going for … Goal: enable strong computational
research and education on a collection that has not been amenable to computational exploration EVER before!
Our Goals Maintain repository of text mining algorithms and
retrieval tools available on-line for human and programmatic discovery. Also register derived data sets, indexes, and versions in registry repository.
Be a user-driven resource, with an active advisory board, and a community model that allows users to share algorithms and tools.
Support interoperability across collections and institutions, through use of inCommon SAML identity.
Our Future Support innovation in cyberinfrastructure to deliver
optimal access and use of the HathiTrust corpus. Implement “Non-consumptive” research: a
technical and intellectual challenge Identify and host existing data analysis, text mining
and retrieval tools that are of interest to the community.
Stimulate development of new analytical methods and tools. We hope that the scale of the HTRC will promote new levels of collaboration in tool development.
HathiTrust Research Center Today HTRC is dedicated to the provision of access to a comprehensive
body of published works for scholarship and education for computational research purposes.
Lightweight Organization Executive Committee
- Beth Plale, Indiana- Scott Poole, Illinois- Robert H. McDonald, Indiana- John Unsworth, Illinois
Advisory Board- TBD
HathiTrust Executive Committee Liaison- Laine Farley, California Digital Library
HathiTrust Research Center Today $250K in funding for initial 18 month startup Creating Themed Collections for early Use Cases
Astronomy – Victorian Literature - Influenza Ingest and Replication Mechanisms Between HT and HTRC
Full-text SOLR indexes Data Capsule integration Karma integration
Integration with SEASR/MEANDRE SOA services at NCSA Alignment with Bamboo Technology Project Alignment with international Google Books Research Centers
Establishing long-term non-consumptive research methodologies
HTRC Proposed Technical ArchitectureCourtesy IU Data to Insight Center – Beth Plale/Yiming Sun
Sample Public Domain Collection
Public-domain OCR Web
Access Servlet
Meandre Workbench
SEASR Infrastructure
Tag Cloud Viewer Data Flow
Book Search Interface by
Author or Title
1. User entersAuthor name or
Volume title
4. Invoke Tag Cloud service
with URL
5. Use URL to Retrieve Volume
6. OCR for volume
7. Tag Cloud
returned to user
Sample Collection
Bibliography Database
2. Query RIS for Author Name or Volume Title
3. Volume ID
JS/PHP Auto-
completer
A persistent RESTful Web Service
Organized as pairtree for demo
only
Administrator creates tag cloud viewer in advance through
SEASR
Converted from MARC to RIS
Current SEASR Integration Demo
Courtesy IU Data to Insight Center – Felix Terkhorn/Yiming Sun
Non-Consumptive Research TrackNo action or set of actions on the part of HathiTrust Research Center users, either acting alone or in cooperation with other users over the duration of one or multiple sessions can result in sufficient information gathered from the HathiTrust collection to reassemble pages from the collection.
Beth Plale(Indiana University)
Atul Prakash(University of Michigan)
Geoffrey Fox(Indiana University)
Robert H. McDonald(Indiana University)
Provision access to copyrighted content for research purpose giving researcher flexible computing resources in controlled environment
Secure Data CapsuleResearcher Access
HathiTrust Digital Library Content
• Access to HT open content indices
• Access to HT copyrighted indices
• Auditable Secure Mechanisms for legal mandated MOU based and fair-use compliance
Researcher Driven Applications for Use as
Services within the Data Capsule
• Can HTRC provide a services framework for researcher applications to run within the secure data capsule compute resources?
HTRC Managed Data-Intensive Compute
Resources
HathiTrust Research Center Events HTRC Kickoff Event at Digital Humanities
Conference 2011 Stanford University - June 20, 2011
Working on models for collaborative research AHRC/ESRC/IMLS/JISC/NEH/NSF/NOW/
SSHRC Digging into Data Round 2 http://www.diggingintodata.org/
Working on early advanced user case studies for the HathiTrust Corpus
Support and Acknowledgements
IU UITS Research Technologies National Center for Supercomputing
Applications IU Data to Insight Center iCHASS Illinois Informatics Institute Lilly Endowment, Inc. The Alfred P. Sloan Foundation
For More on HathiTrust Research CenterSee – http://www.hathitrust-research.org
Follow us @hathitresearch on twitter
Robert H. McDonald@mcdonald on twitterrobert@indiana.edu
top related