xtf in depth
DESCRIPTION
XTF in Depth. Powerful Search and Display for Electronic Text. Martin Haye California Digital Library. January 2009 presentation at University of Sydney. XTF in Depth. Part 1: What is XTF and how does it compare? Who is using it? What needs does it address? New features in 2.1 - PowerPoint PPT PresentationTRANSCRIPT
XTF in Depth
Powerful Search and Display for Electronic Text
Martin HayeCalifornia Digital Library
January 2009 presentationat University of Sydney
XTF in Depth Part 1:
What is XTF and how does it compare? Who is using it? What needs does it address? New features in 2.1 Design and data flow Adapting Lucene and Saxon Planned improvements
Part 2: Interactive demos
XTF in 5 minutes eXtensible Text Framework Search and display technology from CDL Open-source Java framework Powerful and highly configurable All about rapid prototyping, fast deployment,
and incremental improvement XML + Full text search Also indexes PDF, HTML, Word
Excel and Powerpoint coming soon
XTF in 5 minutes Search: Query power/speed of Lucene, plus:
search results shown in context keyword search, facets, spelling, lots more
View: Processing power of Saxon, plus: large file optimizations, hit markup
Configure and customize exclusively in XSLT Flexible, overlapping collections Mature, tightly integrated, well documented In use at CDL and many other places
What XTF is not It is not a content management system
Creation (conversion, scanning, manual) Ingest / administration Editing Preservation
Not built for remote administration Not a true XML database
but close Not Google
Google: one interface to vast grab-bag of data XTF: crafted interfaces to high-quality data sets
How does XTF compare?
Tur
n-ke
y /
easy
----
----
----
--->
Customizable / Powerful ---------------------------------------->
Green-stone
XTF 2.0
XTF 2.1
Solr
* caveat: based on my limited experience with Greenstone and Solr
**
Online Archive of California
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
eScholarship Editions
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
calisphere
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Mark Twain Project Online
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
UC Berkeley
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
University of Sydney
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Encyclopedia of Chicago
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Indiana University: Newton
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Indiana University: Swinburne
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Sweden
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Brazil
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Italy
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Needs
Let’s look at four needs that XTF was created to address: Diverse data Open software Rapid deployment Community involvement
Needs: 1. Diverse data Our collections: many and diverse
eScholarship (TEI, PDF)• UC Press monographs (a text may be > 10 megs)• 25,000 scholarly articles in PDF
Mark Twain• Hand-crafted critical edition (TEI + MODS)
OAC: finding aids, images, books, manuscripts• Japanese American Relocation Digital Archives• TEI, EAD, MODS
Book scanning projects (Google, Internet Archive)• Thousands of scanned books (PDF + DC)• Millions of Melvyl catalog records (MARC)
Needs: 2. Open software
Digital Publishing Products “Black box” (no control over fixes & features) Often not standards-based Tech companies have short lifespans Support often spotty Data can be held hostage, or even lost $$$$$
Needs:3. Rapid deployment
New collections arriving Users don't want to wait a year for access Many “what if” and “wouldn't it be cool”
requests from our staff Java programmers are expensive Look & feel goes stale quickly Barrage of feature requests
Needs:4. Community involvement
We want to share the load For XTF 2.1, we asked the XTF
community to vote for features they wanted
At CDL we try to align our development to needs of the community
Result: Everybody benefits
New and improved in 2.1
Faceted browse Search flexibility Bookbag Spelling correction Similar items OAI-PMH
Faceted browse
Previously implementing faceted browse required lots of XSLT programming.
Hierarchical facets: even harder Required us to deeply refactor the
stylesheets, but now it’s simple to add new facets.
Faceted browse
Faceted browse
Hierarchical facets
Hierarchical facets
Search flexibility
Keyword search: single box (now default). Internally, searches multiple fields.
Advanced search: explicitly fill in constraints for various fields
Freeform search (new): text-based field specifiers, AND, OR, parentheses, etc.
Keyword search
Advanced search
Freeform search
OAI-PMH
This fit nicely into XTF’s architecture Simple but conforming implementation
Bookbag
Refactored the AJAX to use YUI (Yahoo User Interface widgets)
Still session based Now supports emailing the bookbag
Bookbag
Bookbag
Bookbag
Spelling correction
Unicode bug fixes On by default and fully integrated
Spelling correction
Spelling correction
Similar items
Allows user to see “more like this” Improved AJAX integration On by default - no configuration needed
Similar items
Similar items
Other changes in XTF 2.1 Built-in NLM “Blue”, TEI P5, MS Word support
(still support TEI P4, EAD, PDF, HTML, text) Valid XHTML output RawQuery servlet to provide a query back-end
to a (e.g. Ruby) front-end or mash-up. Bug fixes and minor changes (many
reported/requested by users)
Wiki documentation
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Wiki documentation
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Design philosophy Adaptation through programming XTF is still about building what you want using a set of
powerful tools
But now: Stylesheets are more modular Build interfaces faster using honed widgets Prettier UI to start with
XTF is open, standards based Based on free, open-source tools:
Java SDK 1.5+ Lucene 2.1 full-text search toolkit Saxon 8.9 XSLT processor
UNICODE support throughout XTF itself is open-source (BSD license) No native code – pure Java and XSLT 2.0 Runs on Windows, Solaris, Linux, MacOS Drops right in to Tomcat or Resin Lots of user-fixable documentation
Modular Use crossQuery servlet to search, dynaXML
to display and navigate. Deploy one or both. Stylesheets govern flow of data – no Java
programming required Easy to add features incrementally 100% configurable “look and feel” Skin & slice: one system can have several
interfaces and multiple “brands” Collection subsetting driven by meta-data
Why XSLT? XSLT is a natural fit for XML
Powerful, dynamic language Incredibly high-quality, free processor (Saxon)
Why not Java/Struts? Poor for rapid prototyping, steep learning curve
Why not Ruby? Not necessarily a good match for XML data Can be too clever by half But a smart mash-up might be cool...
Indexing Process
Indexing
Input filters adapt to many doc types Any XML doc type PDF, MS Word, plain text, untidy HTML
XTF is agnostic regarding: Document identifiers Filesystem organization
• Uses document selector stylesheet to identify and classify documents in filesystem
Meta-data storage Incremental indexing
Simply update filesystem then run indexer.
crossQuery servlet
Flexible Search/Display
One query, many collections XTF enables “Virtual collections”
Output filters for various result views e.g. simple vs. advanced search form, results in
brief vs. long format, etc. Query parsers for different search interfaces
Interface to other query protocols SRU and OAI-PMH already implemented Should be easy to adapt other queries:
• Very extensive set of query operators• Flexible query composition
Faceted browse
Query Power
Many operators AND, OR, NEAR, NOT, phrase, range, wildcard Or-Near, multi-field AND, “more like this”
Arbitrarily complex queries Combine full-text search with meta-data Unusual queries like:"dynamic duo" near "red phone"
Structure-aware searching e.g. search only headings, or only bibliographies But must pre-define which structures to search
More Power
Fixed-length snippets Highlight the hit and just the hit
Sort by relevance, or any meta-data fields Spelling correction No penalty for huge documents
XTF “lazily” pulls in only those parts used by a particular request (e.g. show just Chapter 1)
Scalable Proven with 10 million records / 14 gigs data but beyond that, Solr looks better
Authentication: IP lists, LDAP, or external
dynaXML servlet
Adapting Lucene and Saxon
Adapting Lucene Chunking, flattening, hit marking, stop-words,
setting limits, insensitivity, special queries, faceted browsing, spelling correction
Adapting Saxon Lazy trees, misc. extensions
Adapting Lucene:Chunking Why
Lucene's proximity searches perform best on small documents
Small chunks enable efficient generation of 80-character “snippet” surrounding each hit
How XTF breaks text blocks into 200-word chunks Chunks overlap to detect a hit starting in one and
ending in the next. Each chunk carries structural info, plus pointer to
location in XML doc. Only first chunk carries meta-data for doc
Adapting Lucene:Flattening XML
XSLT prefilter flattens XML structure Series of text blocks Block tagged with structural info for search Prefilter can boost or suppress sections Fine control over proximity matching
Prefilter gathers/marks meta-data Can come from within the document, from an XML
doc in filesystem, or fetched from a URL. Synthesize meta-data (e.g. sort fields, facets)
Adapting Lucene:Hit Marking
Marking search hits in context Lucene doesn't pinpoint location of hits, only gives a
score per-document Custom enhancements to Lucene's “span” logic
score and locate each hit. dynaXML dynamically adds ranked hits to original
XML doc, then sends to XSLT formatter. crossQuery forms a snippet around and highlights
each hit.
Adapting Lucene:Stop-words
Robust, efficient stop-word handling “the, a, an, it, on...” People do use them, and expect corresponding
results. Lucene normally ignores stop-words, for speed. XTF quietly joins stop-words to adjacent words,
forming “n-grams” Example: “man on the moon” ->
man-on on-the the-moon Queries are internally rewritten to search for n-grams
automatically.
Adapting Lucene:Setting Limits
Limits on aberrant queries Adjustable limits on number of terms matched by
range or wildcard queries N-grams naturally make most queries efficient Configurable limits on amount of “work” performed by
a single query. Numeric range query
Avoids term expansion Efficiently filters very granular data, e.g. timestamps: 2006-11-14:12:46:03.77
Adapting Lucene:Insensitivity
Accent/diacritic marks Many users can't or don't know how to type them XTF indexer uses configurable map to remove
accents crossQuery maps query terms
Plural Convenient for “cat” to match “cats” also Configurable map of plural to singular used at index
and query time
Adapting Lucene:Special Queries
OR-NEAR Standard OR query doesn't use proximity OR-NEAR: if words nearby, score is boosted
Multi-field AND All terms must be present, in any field. Essential for certain keyword searches: against all enemies clarke(matches against title and author)
More like this Auto-calculates “interesting” terms in meta-data Creates OR-NEAR query to find similar docs
Adapting Lucene:Faceted Browsing
Draws facet term list from Lucene index Each facet cached in-memory Counts per group created dynamically Special mini-language to sort/select (esp.
useful for hierarchical facets)
Adapting Lucene:Spelling Correction Any standard dictionary won't match place and
proper names Idea: use the index as source of suggestions XTF searches words within edit distance 2 Candidates ranked by weighted score:
Edit distance (transpositions discounted) Frequency of use in the index Double-metaphone match
Multi-word correction uses pair frequencies On test data, 80% right suggestion
Adapting Saxon:Lazy Trees
The need: display small parts of large (> 10MB) XML documents
Solution: create a binary, random-access version of each document
XSL keys calc'd once and stored Only elements accessed by a given request are
loaded from disk Care must be taken in stylesheets Profile mode is useful for optimization
Adapting Saxon:Extensions More complete SQL database connection Ability to call external tools
Automatic XML conversion in/out Timeout enforcement
File utilities Check file existence Get file length and timestamp
Session data Key/value pairs Value can be XML or plain string
The future XTF 2.2:
Better out-of-box for large EADs Fixes for incremental indexing; other bug fixes Specify any number of sub-dirs to index Possible TEI P5 refactoring Background auto-warming of new index Support for indexing Powerpoint and Excel files
Further out: A page-turner for scanned texts and converted PDFs Pop-up image/PDF page snippets And of course, features suggested by users
Demos
I’ll demonstrate the features we talked about on several different XTF sites “out in the wild.”
Fin Project: xtf.sourceforge.net
Docs: xtf.wiki.sourceforge.net
Discuss: groups.google.com/group/xtf-user
This talk: xtf.sourceforge.net/talks/2009-01-23.ppt