accommodating diverse search requirements over a fedora repository michael durbin and jon w. dunn...
Post on 22-Dec-2015
215 views
TRANSCRIPT
Accommodating Diverse Search Requirements over a Fedora
Repository
Michael Durbin and Jon W. Dunn
Fedora User Group – Open Repositories 2008
April 3, 2008
April 19, 2023Fedora Users Group - Open Repositories 2008
Background
o Indiana University Digital Library Program• Started in 1997
o Diversity of formats and collections• Text, image, musical scores, audio, video, …
o Diversity of search systems• DLXS, XTF, Lucene, DB2 NSE, Oracle Text
o Current project to unify architecture for storage, discovery, and delivery around Fedora
Search System Development
o Phase one: create a search architecture and template for an image based search and discovery application
o Phase two: extend the template and architecture to support more advanced search and discovery applications over different object types
April 19, 2023Fedora Users Group - Open Repositories 2008
PHASE I: CREATING A BASIC IMAGE SEARCH
April 19, 2023Fedora Users Group - Open Repositories 2008
Phase One: Simple Image Search
o Slocum puzzle collection: ideal test caseo Small number of objectso Simple content model• Each object represents a single physical puzzle• Basic metadata: METS, MODS, DC • RELS-EXT isMemberOf relationship with a
collection object• Pre-scaled derivative images
April 19, 2023Fedora Users Group - Open Repositories 2008
April 19, 2023Fedora Users Group - Open Repositories 2008
Requirements: Identifier Resolution
o External Identifiers rather than Fedora PIDs• Seamless migration to Fedora• No commitment to any underlying repository
architectureo Requirement: Quickly resolve our identifier (PURL)
to the Fedora PID
April 19, 2023Fedora Users Group - Open Repositories 2008
Requirements: PURL Identifier Resolution
April 19, 2023Fedora Users Group - Open Repositories 2008
Hypothetical ID Resolution Service
OCLC PURL Resolver
http://fedora.dlib.indiana.edu:8080/fedora/get/iudl:19794/THUMBNAIL
http://purl.dlib.indiana.edu/iudl/lilly/slocum/thumbnail/LL-SLO-004696
Requirements: Keyword and Fielded Search
o Very basic search requirements for any discovery and delivery web application• Keyword search should maximize discovery• MODS fields should be searchable to maximize
accuracy of matches• Search results paging• Support for simple Boolean operators• Wildcard searches are a requirement• Full metadata record (MODS) returned
April 19, 2023Fedora Users Group - Open Repositories 2008
Remaining Requirements
o User interface• Extensible, Reusable, Customizable
o Service oriented approach• Centralize core search system• Standards-based access for integration with
other services and end-user tools
April 19, 2023Fedora Users Group - Open Repositories 2008
Requirements: Search System
April 19, 2023Fedora Users Group - Open Repositories 2008
PURL Resolution
Fielded Search
Fedora Integration
SlocumWebapp
GenericSearch Webapp
UI Layer Search Layer
Solutions: Search Protocol
o Search and Retrieve via URL (SRU)• One of very few standard search protocols• Extremely powerful and flexible query language
(CQL)• Can return records of any type• Most commonly used with DC, MODS, MARCXML
• Has mechanisms for extension in case special needs arise
April 19, 2023Fedora Users Group - Open Repositories 2008
Search System Solutions: SRU
April 19, 2023Fedora Users Group - Open Repositories 2008
PURL Resolution
Fielded Search
Fedora Integration
SlocumWebapp
GenericSearch Webapp
SRU
SRU
UI Layer Search Layer
Solutions: Existing Products
o Fedora Search• Good for finding items based on basic Fedora
metadata, but not for more sophisticated searching
o Fedora Resource Index Search• Also limited to searching basic metadata, not the
content of datastreams
April 19, 2023Fedora Users Group - Open Repositories 2008
Solutions: Existing Products
o Fedora Generic Search Service (GSearch)• Hooks into Fedora• Works with Lucene• Easy to customize search fields though XSLT
transformation of existing metadatao OCLC SRU/W Implementation• Relatively complete implementation in Java, with
ongoing development• Others have had success using with Lucene
April 19, 2023Fedora Users Group - Open Repositories 2008
Search System
April 19, 2023Fedora Users Group - Open Repositories 2008
index
OCLC SRU Implementation
Lucene Databaseextension
Fedora Generic Search Service
Reads
Updates
SRU
Phase 1 Solution: General Applicability
o Pieces of this solution have been used for other image collections
o SRU is used to expose these collections to OneSearch@IU, our federated search service
o The XSLT that assigned metadata to Lucene index fields was a solid base for the indexing needs of other collections.
April 19, 2023Fedora Users Group - Open Repositories 2008
Phase 1 Solution: Lingering Problems
o Our XSLT for the Generic Search Service wasn’t perfect
o Some complications prevented full automationo We punted on getting the perfect Lucene analyzer
configuration
April 19, 2023Fedora Users Group - Open Repositories 2008
PHASE II: EXTENDING FOR DIFFERENT COLLECTIONS
April 19, 2023Fedora Users Group - Open Repositories 2008
EVIA Digital Archive
April 19, 2023Fedora Users Group - Open Repositories 2008
Requirement: EVIADA Video Annotation Collection
April 19, 2023Fedora Users Group - Open Repositories 2008
Video ObjectVideo Object
Video ObjectVideo Object
Video ObjectVideo Object
Field Collection Object
Field Collection Object
Custom Annotation SoftwareCustom Annotation Software
Field Collection
Requirement: EVIADA Video Annotation Collection
o Complex Data model• One Fedora object which is addressable and
discoverable in partso New features• Faceted Search and Browse• Extensive custom fields
April 19, 2023Fedora Users Group - Open Repositories 2008
Requirements: IN Harmony Sheet Music Collection
April 19, 2023Fedora Users Group - Open Repositories 2008
Requirements: IN Harmony Sheet Music Collection
o Complex Content model• Three types of objects below the collection• Sheet music• Individual Score• Page Image
April 19, 2023Fedora Users Group - Open Repositories 2008
Chariot Race MarchChariot Race March
Requirements: IN Harmony Sheet Music Collection
o New Features• Faceted Search and Browse• Exact match searches• Date range searches• Dozens of very specific fields• Sorting by date or title
April 19, 2023Fedora Users Group - Open Repositories 2008
Options:
o Extend our existing implementation• All too appealing because
of familiarity and “sunk costs”
• Major conflicts between existing model and desired model could result in unmaintainable “hackish” implementations
April 19, 2023Fedora Users Group - Open Repositories 2008
o Switch to a new infrastructure• Would be great, if
something existed that met our needs without having to rework everything
o Some combination• Best of both worlds?
Options: Faceted Search and Browse
o Use Solr• Built-in support for facets• Is a service layer with an XML response
• But do we really want to abandon SRU, or maintain two search service protocols?
April 19, 2023Fedora Users Group - Open Repositories 2008
Options: Faceted Search and Browse
o Extend SRU Implementation• Prevents the need for yet another service layer• Has wide reuse potential
• Could be backed by Solr without substantially more effort.
April 19, 2023Fedora Users Group - Open Repositories 2008
Solution: Faceted Search over SRU
April 19, 2023Fedora Users Group - Open Repositories 2008
SRU Service
(now with facet support)
Solution: Other SRU Improvements
o More complete CQL support• Easy Improvements• Operators (and, or, not, any, all)• Application-specific fields
April 19, 2023Fedora Users Group - Open Repositories 2008
Solutions: Other SRU Improvements
o More complete CQL support • Difficult Improvements• “cql.exact” relation• facet implementation• sort support
April 19, 2023Fedora Users Group - Open Repositories 2008
dc.subject exact “United Kingdom”
index
dc.subjectdc.subject.exact
dc.subject
dc.subject.sort
Options: Index Generation
April 19, 2023Fedora Users Group - Open Repositories 2008
Fedora Generic Search Service
Homegrown Solution
Reconsideration: GSearch
o Limited by the one to one relationship between Lucene documents and fedora objects
o Storing valid XML in CDATA to be stored in Lucene is messy and is prone to error as the metadata becomes more diverse
o We really only use it to generate a Lucene index
April 19, 2023Fedora Users Group - Open Repositories 2008
Consideration: Solr
o Robust wrapper for Lucene• Exposes service to update index• Exposes search features as a service• Abstracts away much of the of complexities of
Luceneo Migrating existing search indexes would be
prohibitively time consuming, but it might be the best tool to bring up new collections
April 19, 2023Fedora Users Group - Open Repositories 2008
Solution: Custom index service
o A service whose initial functionality is simply to create and maintain Lucene Index directories that are served by SRU.• Can easily be extended/configured to use
different search engines or to delegate the process entirely (perhaps to Solr)
o Support for existing GSearch style XSLTo Simple Java interface to allow for easy index
implementations.
April 19, 2023Fedora Users Group - Open Repositories 2008
Search Service
April 19, 2023Fedora Users Group - Open Repositories 2008
index
OCLC SRU Implementation
Lucene Database – configured for quick id resolution
Custom Index Service
Lucene Database – configured for basic search
index
index
Basic Index Writer
GSearch Style XSLT Index Writer
Lucene Database – configured for advanced search
New Style XSLT Index Writer
Compound Model Java Index Writer
indexLucene Database – configured for compound model searches
Search Service
April 19, 2023Fedora Users Group - Open Repositories 2008
index
OCLC SRU ImplementationLucene Database – configured for quick id resolution
Custom Index Service
Lucene Database – configured for basic search
index
index
Basic Index Writer
G Search Style XSTL Index Writer
Lucene Database – configured for advanced search
New Style XSTL Index Writer
Compound Model Java Index Writer
index
Lucene Database – configured for compound model searches
Solr Database – configured to interface with solr.
Solr
Solr Wrapping Index
Future Plans
o Full Text searching• Search text of entire books or journals• Determine where in the hierarchy the match
occurred• Provide snippets with highlighted matches in
context for the search results listingo Solutions• XTF, Solr through our custom index service
April 19, 2023Fedora Users Group - Open Repositories 2008
Conclusion
o Most of the work is configuring the index which is a requirement that cannot be avoided.
o Migration doesn’t have to be difficult or disruptiveo Always be willing and able to consider new
products and technologies
April 19, 2023Fedora Users Group - Open Repositories 2008
Thanks! Any Questions?
o www.dlib.indiana.eduo wiki.dlib.indiana.edu/confluence/x/AQI
o [email protected] [email protected]
April 19, 2023Fedora Users Group - Open Repositories 2008