al akhawayn university in ifraneh.haddouti/xml_indexing_paper.… · web viewxml technology is...

Al Akhawayn University in IfraneSchool of Science and Engineering

CSC 5370XML and Data Management

Indexing structure of XML

by

Soumia Elghani&

Hanaa Talei

Supervised by

Dr.Hachim Haddouti

17th, October 2003

Contents

1. MOTIVATION.....................................................................................................................3

2. INTRODUCTION.................................................................................................................3

3. FULL TEXT INDEX............................................................................................................3

3.1. B+ TREE............................................................................................................................33.2. INVERTED LIST..................................................................................................................3

4. GRAPHS................................................................................................................................4

5. NATIX....................................................................................................................................4

5.1. DEFINITION.......................................................................................................................45.2. NATIX ARCHITECTURE......................................................................................................45.3. NATIX PHYSICAL MODEL..................................................................................................5

5.3.1. Object content...........................................................................................................55.3.2. Large trees................................................................................................................5

6. SPHINX.................................................................................................................................5

6.1. DOCUMENT GRAPH..........................................................................................................76.2. SCHEMA GRAPH...............................................................................................................76.3. B-TREES............................................................................................................................7

7. LORE SYSTEM....................................................................................................................7

7.1. VALUE INDEX OR VINDEX................................................................................................87.2. TEXT INDEX OR TINDEX...................................................................................................87.3. LINK INDEX OR LINDEX....................................................................................................87.4. PATH INDEX OR PINDEX...................................................................................................8

8. INDEX FABRIC...................................................................................................................8

9. CONCLUSION...................................................................................................................10

10. References...........................................................................................................................11

2

1. Motivation

As the Web becomes bottleneck with billions of documents, searching for a specific one becomes impossible. To overcome this problem, we need an alternative technique that will provide an efficient queering => indexing techniques.

2. Introduction

Paths are the common way used to query semi-structured data; this process can be improved efficiently using indexing. Databases are used to manage semi-structured data; this latter are most of the times represented as graphs. Native, relational, and object-oriented databases use different approaches to deal with semi-structured data; in order to provide a better approach in terms of performance indexing is used in the following way. Natix, index Fabric, Lore and Sphinx are a set of systems that use indexing to query XML documents.

3. Full Text Index

Full text search functionality has become an integrated part of modern databaseManagement systems. XML technology is growing in a tremendous way; this implies that the number of XML documents is increasing too. Storing and retrieving these documents became a crucial matter. A new feature of full text search is to use XML documents in order to narrow the scope of search. For example, the search can be narrowed to a specific part of the document. This implies that when retrieving the document, we need to separate the structure and content of documents.

3.1. B+ treeThe B+ Tree index structure is the most widely used of several index structures that

maintain their efficiency despite insertion and deletion of data. It takes the form of a balanced tree in which every path from the root of the tree to a leaf is of the same length. The following table describes some of the differences between B and B+ tree.

B Tree B+ Tree Multiply way trees Dynamic growth Contains data pages

Contains features from B tree

Contain index ad data pages

Dynamic growth

Examples:See examples in the slides 7 and 8.

3.2. Inverted listOne of the advanced indexing technologies is Inverted List Indexes, which provide

much greater functionality and flexibility than B-tree indexes. Inverted list indexes reverse the structure of the data with its pointers. They store the data from the database as keys so the data content can be quickly searched on, with pointers back to the database as data in the index so the data records can be quickly retrieved. There will be two tables as follow:

Word Document ID Document ID Word list

3

Word1 1,2…. 1 Person, id…

Word2 1 2 Student, course,

An Alternative solution that was proposed during the presentation is to include the word ID I the table to make the search faster.

4. Graphs

It is another way used to represent data.

Figure 1: Directed acyclic graph (DAG) Figure 2: An index

In the first figure there are many links that need to be reduced. To convert the first figure to the second one, we use the concept of language equivalent. We say that two nodes are language equivalent if the paths from the root to the nodes are the same (i.e: p1 and p2) Nodes that are language equivalent in the semi structured data are put in one node in the graph; In this way you get a simplified tree that can be stored efficiently in memory. (See the example of the book page 187)

5. Natix

5.1. Definition

Natix is an XML repository at Mannheim University: an efficient, native repository for storing, retrieving and managing tree structured large objects, preferably XML documents. It is one of the systems that uses the split algorithm that ddynamically maintains physical records of size smaller than a page, which contain sets of connected tree nodes.

5.2. Natix architecture

Natix architecture consists of 6 modules that are:

4

Record manager: it provides memory spaces (divided into segments= collection of equal sizes pages that holds one or more records.)

Tree storage manager: it maps the tree used to model the document. Index management Query engine Schema manager (it takes care of the DTD) Document manager (it validates the schema)

5.3. Natix physical model There are two ways to classify the physical node: object content, and large tree. 5.3.1. Object content There are three kinds: aggregate (inner nodes), literal (leaf nodes), and proxy (nodes pointing to different records)5.3.2. Large trees They are split into sub trees where each tree is stored in a record.

Figure 3: large tree

Then one possibility to distribute nodes on records is:

Figure 4: split Tree

In this case, the physical tree is distributed into 3 records; we needed 2 proxies and two helpers h1 and h2 that are used to group children under p1 and p2 into records.

6. Sphinx

Sphinx stands for Schema-conscious Path-Hierarchy Indexing of Xml. It is a new XML indexing scheme that uses the DTD to extensively accelerate the search process. Its approach consists on two conversions:

The XML document is converted into its equivalent graph representation, called the “Document Graph”.

5

Not implemented yet

The DTD is converted into a graph-based representation, called the “Schema Graph”.

An example DTD and its corresponding XML document are shown in figure 5.

Figure 5: a DTD and its corresponding XML document

Figure 6 depicts the Document Graph, the Schema Graph and leaf-level B-tress corresponding to the DTD and XML document provided in figure 5.

6

Figure 6: Document Graph and the Schema Graph

With reference to figure 6, we describe each of the structures in more detail below:6.1. Document Graph

The Document Graph is a graph where The elements of the XML document form the nodes, The edges represent the parent-child relationships, And the atomic values form the leaves.

The attributes representation is similar to elements. They form additional nodes in the graph. In order to make both top-down and bottom-up traversal of the graph possible, the document graph is created with bi-directional links.

6.2. Schema GraphIn the Schema Graph each node represents either an element or an attribute that is present in the DTD. The Schema Graph is also bi-directional. In the process of transformation from the DTD to this graph-equivalent, all element cardinality constraints are ignored. For example: ‘a?’, ‘a+’ and ‘a*’ are replaced by ‘a’. The alternation operator | is replaced by the more general conjunction. For example, (a|b) is replaced by (a,b). Each of the leaves of the Schema Graph contains either a pointer to a B-tree or a NULL value. 6.3. B-treesThe atomic values that have the same path are gathered in one B-tree in the Schema Graph. For example the leftmost B-tree in figure 6 is built on all atomic year values that appear as the values of the “/bib/book/RefID” path in the Document Graph.

7. Lore System

Lore system is a database management system designed specially for semi-structured data. It uses Object Exchange Models (OEM), which is a label directed graph as shown in the figure 7.

7

Figure 7: Object Exchange Model Graph

The vertices in the graph are objects. Each object has a unique object identifier (OID).There are four different types of indexes that can be built over a Lore database. These indexes can be classified into two groups:

Indexes that identify objects with specific values:• Value index: • Text index

Indexes used to traverse the database graph:• Link index• Path index

7.1. Value index or Vindex Vindex allows fast retrieve of all objects reachable by an edge and matching a comparison

predicate. It is implemented as B+trees. It takes a label ‘l’, a comparator ‘c’ and a value ‘v’ and returns all atomic objects having an edge with the label ‘l’ and a value satisfying the comparator ‘c’ and the value ‘v’.Example: A Vindex is created for incoming label Price over the database in figure 7. If a lookup is performed for values > 15.00 with the edge Price, the result is {&11, &15}.7.2. Text index or Tindex

Tindex is implemented using inverted lists. It maps a given word ‘w’ and a label ‘l’ to a list of atomic values with incoming edge ‘l’ that contains the word ‘w’. The label can be omitted for a full search. Tindex returns a list of postings (o,n) that indicate that ‘w’ appears in object ‘o’ as the nth word in the value.Example:A Tindex is created for all objects with an atomic string value containing the word “Ford” and an edge Name (figure 7). The result is {(&17, 2), (&21, 2)} 7.3. Link index or Lindex

Since inverse pointers are not supported in OEM graphs, the Lindex provides a mechanism for retrieving the parents of an object via a given label. It takes a child object ‘c’ and a label ‘l’ and returns all parents ‘p’ such that there is an l-labeled edge from p to c. If the label is omitted, lindex returns all parents and their labels.

7.4. Path index or PindexPindex is used to find all objects reachable by a given labeled path. It takes a given object

‘o’ (e.g. root) and a path ‘p’ and returns the set of objects reachable from ‘o’ following path ‘p’.Example:If the query “select DB.Movie.Title” is applied over the database in figure 7, Pindex is used to directly locate all objects reachable via DB.Movie.Title. The result is {&5, &9, &14}.

8. Index Fabric

Index Fabric is an indexing scheme that optimizes searches over semi-structured databases. It is based on Patricia tries. An example of Patricia trie is shown in figure 8.

8

Figure 8: Patricia trie.

Patricia stands for “Practical Algorithm to Retrieve Information Coded in Alphanumeric”, and the word “trie” is taken from “retrieve”. In a Patricia trie the nodes are labelled with their depth. In Index Fabric, data paths are encoded using designators that are characters or strings. For example, for the XML documents in figure 9, we can choose I for <invoice>, B for <buyer>, N for <name>, and so on. Then the string “IBNABC Corp” has the same meaning as the XML fragment: <invoice><buyer><name>ABC Corp</invoice></buyer></name>.

Figure 9: Two XML documents

A designator dictionary is used to interpret the designators. The XML documents of figure 9 can be encoded as a set of raw paths. First, as shown in figure 10(a), designators are assigned to tags. Next, the root-to-leaf paths are encoded to produce the keys shown in figure 10(b). Finally, these keys are inserted in the Index Fabric to generate the Patricia trie shown in figure 11.

Figure 10

9

Figure 10: (a) Designators (b) Encoded paths

Figure 11: Patricia trie used by the Index Fabric9. Conclusion

In this paper we have seen a number of indexing techniques. There are many other techniques used in other systems. Each technique uses a different approach. Some of these techniques are still under construction like the indexing structure used in Natix system. This goes to show that indexing structure of XML is a domain that is still developing and improving.

10

10. References

Graphs: S. Abiteboul, P. Buneman, D. Suciu, “Data on the Web: from relations to semistructured data and XML”, Morgan Kuafman, 2000.

Natix:C.C Kanne, Guido Moerkotte. “Efficient storage of xml data“. Proc. of ICDE, California, USA, page 198, 2000. http://citeseer.nj.nec.com/kanne99efficient.html

Sphinx: L. K. Poola and J. R. Haritsa. "SphinX: Schema-conscious XML Indexing", Indian Institute of Science, 2001. http://citeseer.nj.nec.com/poola01sphinx.html

Lore System: J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajamaran. “Indexing semistructured data”. Technical report, Stanford University, Computer Science Department, 1998. http://citeseer.nj.nec.com/mchugh98indexing.html

Index Fabric: B. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon. “A fast index for semistructured data.” In Proceedings of VLDB, 2001. http://citeseer.nj.nec.com/cooper01fast.html

11

http://citeseer.nj.nec.com/cooper01fast.html

http://citeseer.nj.nec.com/mchugh98indexing.html

http://citeseer.nj.nec.com/poola01sphinx.html

http://citeseer.nj.nec.com/kanne99efficient.html

al akhawayn university in ifraneh.haddouti/xml_indexing_paper.… · web viewxml technology is...

Documents