sanjay agrawal microsoft research surajit chaudhuri microsoft research gautam das microsoft research...

21
Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over Relational Databases Presented by: DEEP PANCHOLI 1000556121

Upload: christopher-tom

Post on 31-Mar-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Sanjay Agrawal Microsoft Research

Surajit Chaudhuri Microsoft Research

Gautam Das Microsoft Research

DBXplorer: A System for Keyword Based Search over Relational

Databases

Presented by:

DEEP PANCHOLI 1000556121

Page 2: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

IntroductionThe two most common types of search are

Structured Search and Keyword Based SearchExample of Structured Search

http://autos.yahoo.com/A similar example is to search for books in

bookseller’s database e.g. Books->Travel->Maps->Asia->Russia

We all already know what is keyword based search and one example can be searching for ‘Jim Gray’ on Microsoft Intranet to obtain matched rows

Page 3: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Introduction (cont)Problems faced with Keyword based search

implementationNeed to know schemaNormalized databasesAvailability of indexes

Built on the concept laid by BANKS paper explained in the last lecture.

Symbol tablesCompacting the symbol tablesSearch requirements

Page 4: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Overview of DBXplorerDBXplorer returns all rows either from single

table or from multiple tables, using FK-joins, such that each row has all the keywords

Publish1. Identify a database and tables and columns

within it that are to be enabled for search2. Create auxiliary tables (Symbol Tables)

Search1. Look up the Symbol table2. Searching in possible subsets of tables3. Construct and execute SQL statement and rank

the results before displaying to user

Page 5: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Different Symbol Table DesignsWe will only consider exact match problemTwo important levels of granularities

Column level granularity (Pub-Col)Cell level granularity (Pub-Cell)

Table=Authors

Fname Lname

John Marshall

John Shawn

Shawn Archer

Jacquelyn Marshall

Page 6: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Factors that affect granularitySpace and time requirements

Pub-Col is faster and occupies less spaceKeyword search performance

Pub-Col if there is an index on the columnEase of Symbol table maintenance

Pub-Col is easier to maintain as it contains updates only if there is addition of a new distinct values

Hence, the Pub-Col alternative is almost always better than Pub-Cell unless if certain columns contain no indexes

If an index is available for column, we should use Pub-Col granularity

Page 7: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Pub-Col representationStore simply as Keyword-ColIdAlternative is to use Hashvalue-ColId since

storing keywords is wasteful as strings can be long and of varying lengths

Compression AlgorithmsFK-Comp: If column c1 is a subset of values

in another column c2, we retain only values in c1

CP-Comp: It is used when pairs of columns share common keywords but are not tied by FK

Page 8: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Pub-Col Algorithm

Page 9: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Search ComponentCommon step for all kinds of granularitiesIt makes use of join treesHence, if we join tables that occur in the join

tree the resulting relation will contain all potential rows containing all keywords specified in the query

Example of graph treeFinally SQL query is generated and runThe result is then ranked before outputting. The

basic approach is to rank them based on the number of joins involved which is similar to Banks approach

Page 10: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Search Algorithm

Page 11: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Case of Token matchesToken matches are matches in which

keyword match with a token or a substring of attribute value

Pub-Prefix method efficiently enables token match capabilities by exploiting available B+ tree indexes

Symbol table has entry (hash(k),T.C, P)

Page 12: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Case of Token matches (cont)Pub-Prefix method result is comparable to Pub-

Cell method when the column width is small (i.e. less than 100 characters)

For columns where strings are greater than hundreds of characters, Pub-Cell outperforms Pub-Prefix significantly

Important issue is to determine the appropriate prefix length stored in symbol table. However, Pub-Prefix method is still being researched upon

Other research is going on in field of stemming of query keywords

Page 13: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Experimental ResultsThe experiments were carried out on a

450MHz 256 MB Intel P-3 machine. There were 4 databases used for evaluation:TPC-H data of sizes from 100 to 500 MBUSR is Microsoft employee address DB of 130

MB with 19 tablesML is a 375 MB mailing list DB with 38 tablesKB is a 365 MB DB with 84 tables containing

information on articles and help manuals on various shipped products

Page 14: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

System Architecture for DBXplorer

Page 15: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Experimental Results (cont)In particular the authors show the

following:Pub-Col is compact compared to Pub-CellPub-Col scales linearly with data size and is

independent of data distributionPub-Prefix is compact compared to Pub-Cell

and has a significantly better performance when full text indexes are not present

Page 16: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Pub-Col and Pub-Cell symbol table size comparison

Page 17: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Symbol table publishing time comparison

Page 18: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Query performance

Page 19: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Other ObservationsIt was also noticed that search scales with

number of query keywords. The query was varied with 2 to 10 keywords and still the average query time was between 1 to 1.3 seconds

Also, it was noticed that FK-Comp and CP-Comp reduce the size of Pub-Col by a factor of 0.45 to 0.90 depending on size of original table

However, it was noticed that compression added a negligible overhead on search performance

Page 20: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

Effectiveness of Pub-Prefix methodThe Pub-prefix method was tested on workload

consisting of 100 random keywords from character column of width 64 bytes in the KB database.

It was noticed that the performance of Pub-Prefix increased with increase in Pub-Prefix length and gave the optimum performance at prefix-length of 8

This is because as the length increases, beyond a certain limit the optimizer decides to scan the original table compared to index search

Page 21: Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over

ConclusionAlthough, we discussed only about a single

database query, this technique can be applied to search multiple databases also

DBXplorer is easy to use with any Database Management system

As mentioned before, the Pub-Col alternative is the best when columns have indexes on them. A hybrid table can be created so that if there is an index for a column, we use Pub-Col granularity and if there is no index, we use Pub-Cell granularity