indexing in exist database

24
eXist Indexing Using the right index for you data Date: 9/29/2008 Dan McCreary President Dan McCreary & Associates [email protected] (952) 931-9198 M D Metadata Solutions

Upload: redchilly

Post on 23-Jul-2015

445 views

Category:

Technology


0 download

TRANSCRIPT

eXist IndexingUsing the right index for you data

Date: 9/29/2008

Dan McCrearyPresidentDan McCreary & [email protected](952) 931-9198

M

D

Metadata Solutions

Copyright 2008 Dan McCreary & Associates 2

M

D

Overview

• Using eXist Indexes• Types of indexes• Configuring indexes• Testing indexes

Copyright 2008 Dan McCreary & Associates 3

M

D

Index Types

Structural Indexes: These index the nodal structure, elements (tags) and attributes, of the documents in a collection.

Range Indexes: Ideal for indexing measurements (integers, doubles, floats,

currency or discrete value measurements). Full Text Indexes: These map specific text nodes and attributes of the documents

in a collection to text tokens.

NGram Indexes: These map specific text nodes and attributes of the documents in a collection to split tokens of n-characters (where n = 3 by default). Very efficient for exact substring searches and for queries on software program code which can not be easily split into whitespace separated tokens and are thus a bad match for the full text index.

Spatial Indexes (Experimental): These map elements of the documents in a collection containing geo-referenced geometries to dedicated data structures that allow efficient spatial queries.

Copyright 2008 Dan McCreary & Associates 4

M

D

Structural Indexes

• Keeps track of the elements (tags), attributes, and nodal structure for all XML documents in a collection

• It is created and maintained automatically in eXist• Can not be reconfigured nor disabled by the user• Used by all non-wildcard XPath and XQuery

expressions in eXist (not “//*”)• Stored in the database file elements.dbx

Copyright 2008 Dan McCreary & Associates 5

M

D

How Do Structural Indexes Work?

• Maps every element and attribute qname (or qualified name) in a document collection to a list of <documentId, nodeId> pairs.

• This mapping is used by the query engine to resolve queries for a given XPath expression.

• Example:– //book/section– eXist uses two index lookups: the first for the <book> node, and

the second for the <section> node– eXist computes the structural join between these node sets to

determine which <section> elements are in fact children of <book> elements

Copyright 2008 Dan McCreary & Associates 6

M

D

Range Index

• Range indexes provide a shortcut for the database to directly select nodes based on their typed values.

• Used when matching or comparing nodes by way of standard XPath operators and functions.

• Without a range index, comparison operators like =, > or < will default to a "brute-force" inspection of the DOM, which can be extremely slow if eXist has to search through maybe millions of nodes: each node has to be loaded and cast to the target type.

Copyright 2008 Dan McCreary & Associates 7

M

D

Example

• You have a catalog of items that contain 50,000 items

• You want to find all items that have a price under $100

• XPath: //item[price < 100.0] • Without a range index you would have to do up to

50,000 comparisons for each search• With a range index it would quickly find the

subset that have a price under $100 with a single lookup

Copyright 2008 Dan McCreary & Associates 8

M

D

Restriction on Ranges

• All collections that are included in the search must be indexed

• The data types must match• Their must be no context dependencies

Copyright 2008 Dan McCreary & Associates 9

M

D

All Collections Must be Indexes

• The range index must be defined on all items in the input sequence– If you search collections A and B but only A is

range indexed, the query will not use the indexes

Collection A Collection Bwith range index no range index

XQuery

Copyright 2008 Dan McCreary & Associates 10

M

D

Fulltext Fallback

• If all collections do not have the exact same type of range index the search will automatically revert to using the default fulltext indexes (slow)

Copyright 2008 Dan McCreary & Associates 11

M

D

Data Types Must Match

• The index data type (first argument type) must match the test data type (second argument type)

• Wrong– //item[price = '1000.0']

• Right– //item[price < xs:double($max-price)]

Copyright 2008 Dan McCreary & Associates 12

M

D

Context Dependencies

• The right-hand argument must not have dependencies on the current context item.

• Wrong:– //item[price = self]

• Right:– //item[xf:double($max-price) < price]

Copyright 2008 Dan McCreary & Associates 13

M

D

Fulltext Index• Used to query for a sequence of separate "words" or tokens in a longer stream

of text.• While building the index, the text is parsed into single tokens which are then

stored in the index.• Historically, eXist has been creating a default full text index on all text nodes

and attribute values. This will likely change in the future as the index is undergoing a major redesign. As the index becomes more configurable, we may drop the current default indexing behaviour.

• Anyway, as for the other index types, you can configure the full text index in the collection configuration and we will try to keep the configuration of the new index backwards compatible. We thus recommend to create a collection configuration file, disable the default index-all behaviour and define some explicit full text indexes on your documents. The details of this process will be described below.

• The full text index is only used in combination with eXist's fulltext search extensions. In particular, you can use the following eXist-specific operators and functions that apply a fulltext index:

Copyright 2008 Dan McCreary & Associates 14

M

D

Fulltext Operators and Functions

• Operators:– &=

– |=

• Main Functions– text:match-all()– text:match-any()– near()

Copyright 2008 Dan McCreary & Associates 15

M

D

Disabling Indexes

• If you have disabled full text indexing for certain elements, these operators and functions will also be effectively disabled, and will not return matches.

• eXist will not return results for queries that normally would have results provided fulltext indexing was enabled.

• This is in direct contrast to the operation of range indexing, which does fallback to full searching of the document if no range index applies

Copyright 2008 Dan McCreary & Associates 16

M

D

Geospatial Indexing (Beta)

• A working proof-of-concept index, which listens for spatial geometries described through the Geography Markup Language (GML)

Copyright 2008 Dan McCreary & Associates 17

M

D

Sample Geospatial Data

<gml:Polygon xmlns:gml="http://www.opengis.net/gml" srsName="osgb:BNG"> <gml:outerBoundaryIs> . <gml:LinearRing> <gml:coordinates> 278515.400,187060.450 278515.150,187057.950 278516.350,187057.150 278546.700,187054.000 278580.550,187050.900 278609.500,187048.100 278609.750,187051.250 278574.750,187054.650 278544.950,187057.450 278515.400,187060.450 </gml:coordinates> </gml:LinearRing> </gml:outerBoundaryIs></gml:Polygon>

Copyright 2008 Dan McCreary & Associates 18

M

D

Sample of Geospatial Queries

• What is the distance from point X to point Y?

• What items are within X miles of this point?• What are inside county Y?

Copyright 2008 Dan McCreary & Associates 19

M

D

Custom Indexing

• eXist version 1.2 and later feature a modularized indexing architecture

• Allows arbitrary indexes to be plugged into an indexing pipeline

• Required Java development skills• See

– http://exist-db.org/devguide_indexes.html

Copyright 2008 Dan McCreary & Associates 20

M

D

For the eXist Database Administrator

• For each collection you want to administer– /db/foo - create a file collection.xconf and store

it as /db/system/config/db/foo/collection.xconf

• Inheritance– Subcollections which do not have a

collection.xconf file of their own will be governed by the configuration policy specified for the closest ancestor collection which does have such a file

Copyright 2008 Dan McCreary & Associates 21

M

D

Inheritance Example

/db

/db/foo

/db/foo/bar

/db/system/config/db/foo/collection.xconf

If no collection exists for this collection it will default to theparent’s collection configuration.

Copyright 2008 Dan McCreary & Associates 22

M

D

Thank You!

Please contact me for more information:• Native XML Databases• Metadata Management• Metadata Registries• Service Oriented Architectures• Business Intelligence and Data Warehouse• Semantic Web

Dan McCreary, PresidentDan McCreary & Associates

Metadata Strategy [email protected]

(952) 931-9198

Copyright 2008 Dan McCreary & Associates 23

M

D

Index Creation and Updates

• The eXist index system automatically maintains and updates indexes defined by the user

• You therefore do not need to update an index when you update a database document or collection.

• eXist will even update indexes following partial document updates via XUpdate or XQuery Update expressions.

• The only exception to eXist's automatic update occurs when you add a new index definition to an existing database collection

Copyright 2008 Dan McCreary & Associates 24

M

D

Sample Collection Index<collection xmlns="http://exist-db.org/collection-config/1.0"> <index> <fulltext default="none" attributes="false"> <!-- Full text indexes --> <create qname="author"/> <create qname="title" content="mixed"/> </fulltext>

<!-- Range indexes --> <create qname="title" type="xs:string"/> <create qname="author" type="xs:string"/> <create qname="year" type="xs:int"/>

<!-- N-gram indexes --> <ngram qname="author"/> <ngram qname="title"/> </index></collection>