xtf in depth

XTF in Depth

Powerful Search and Display for Electronic Text

Martin HayeCalifornia Digital Library

January 2009 presentationat University of Sydney

XTF in Depth Part 1:

What is XTF and how does it compare? Who is using it? What needs does it address? New features in 2.1 Design and data flow Adapting Lucene and Saxon Planned improvements

Part 2: Interactive demos

XTF in 5 minutes eXtensible Text Framework Search and display technology from CDL Open-source Java framework Powerful and highly configurable All about rapid prototyping, fast deployment,

and incremental improvement XML + Full text search Also indexes PDF, HTML, Word

Excel and Powerpoint coming soon

XTF in 5 minutes Search: Query power/speed of Lucene, plus:

search results shown in context keyword search, facets, spelling, lots more

View: Processing power of Saxon, plus: large file optimizations, hit markup

Configure and customize exclusively in XSLT Flexible, overlapping collections Mature, tightly integrated, well documented In use at CDL and many other places

What XTF is not It is not a content management system

Creation (conversion, scanning, manual) Ingest / administration Editing Preservation

Not built for remote administration Not a true XML database

but close Not Google

Google: one interface to vast grab-bag of data XTF: crafted interfaces to high-quality data sets

How does XTF compare?

Tur

n-ke

y /

easy

----

----

----

--->

Customizable / Powerful ---------------------------------------->

Green-stone

XTF 2.0

XTF 2.1

Solr

* caveat: based on my limited experience with Greenstone and Solr

**

Online Archive of California

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

eScholarship Editions



calisphere



Mark Twain Project Online



UC Berkeley



University of Sydney



Encyclopedia of Chicago



Indiana University: Newton



Indiana University: Swinburne



Sweden



Brazil



Italy



Needs

Let’s look at four needs that XTF was created to address: Diverse data Open software Rapid deployment Community involvement

Needs: 1. Diverse data Our collections: many and diverse

eScholarship (TEI, PDF)• UC Press monographs (a text may be > 10 megs)• 25,000 scholarly articles in PDF

Mark Twain• Hand-crafted critical edition (TEI + MODS)

OAC: finding aids, images, books, manuscripts• Japanese American Relocation Digital Archives• TEI, EAD, MODS

Book scanning projects (Google, Internet Archive)• Thousands of scanned books (PDF + DC)• Millions of Melvyl catalog records (MARC)

Needs: 2. Open software

Digital Publishing Products “Black box” (no control over fixes & features) Often not standards-based Tech companies have short lifespans Support often spotty Data can be held hostage, or even lost $$$$$

Needs:3. Rapid deployment

New collections arriving Users don't want to wait a year for access Many “what if” and “wouldn't it be cool”

requests from our staff Java programmers are expensive Look & feel goes stale quickly Barrage of feature requests

Needs:4. Community involvement

We want to share the load For XTF 2.1, we asked the XTF

community to vote for features they wanted

At CDL we try to align our development to needs of the community

Result: Everybody benefits

New and improved in 2.1

Faceted browse Search flexibility Bookbag Spelling correction Similar items OAI-PMH

Faceted browse

Previously implementing faceted browse required lots of XSLT programming.

Hierarchical facets: even harder Required us to deeply refactor the

stylesheets, but now it’s simple to add new facets.

Faceted browse

Hierarchical facets

Search flexibility

Keyword search: single box (now default). Internally, searches multiple fields.

Advanced search: explicitly fill in constraints for various fields

Freeform search (new): text-based field specifiers, AND, OR, parentheses, etc.

Keyword search

Advanced search

Freeform search

OAI-PMH

This fit nicely into XTF’s architecture Simple but conforming implementation

Bookbag

Refactored the AJAX to use YUI (Yahoo User Interface widgets)

Still session based Now supports emailing the bookbag

Bookbag

Spelling correction

Unicode bug fixes On by default and fully integrated

Spelling correction

Similar items

Allows user to see “more like this” Improved AJAX integration On by default - no configuration needed

xtf in depth

Documents

xtf community

placeswhat xtf

loadfor xtf

vast grabbag of data

chicagoindiana university

newtonindiana university

data flowadapting lucene

contextkeyword search