final report - advanced search engine

8/3/2019 Final Report - Advanced Search Engine

1/23

Final Report

Advanced Keyword Search EngineBy

Nikhil Pratap

Sathishkumar Poornachandran

CSE 511

5/5/2011

I. ABSTRACT


2/23

We have introduced an efficient advanced keyword search engine on XML dataset with ranked

results based on the total number of reviews, review date posted and rating for each product. The

advanced search engine was implemented using Boolean operators AND, OR & NOT. Using

these operators, the exact results will be shown to the user. We have also implemented features

like wild card and Phrase search that would enhance the users searching experience.

II. INTRODUCTION & MOTIVATION

The existing problem with Amazon is that user can query the search engine based on certain pre-

defined categories like Electronics, Computers, Books, Cosmetics etc., There are no options for

users to do real time queries based on the attributes of the products. In this paper, we are

proposing an advanced search query engine which would allow users to enter more inputs that

can be used for filtering out search results based on relevancy. Our advanced search query option

would also support operators like AND, OR, NOT that are just considered as a normal text

strings in Amazon. Eg) Search query Electronics = (computers not laptop) would exactly fetch

laptop information in Amazon. Here the word not is considered as just string. This paper also

introduces the concept of phrase search, wild card search and an intuitive ranking algorithm

based on number of reviews, review dates and ranking.

III.ARCHITECTURE

The main architecture of the system can be divided into two categories.

a) Offline Computation.

b) Online Computation.

a) Offline Computation

Page 2


3/23

In the offline computation, we are extracting the data from the XML dataset via SAX & DOM

PARSER and storing the data in the indexer file as inverted index. For our scenario, we are

storing values as fields like title, brand, reviews, model etc. The illustration of the process is

given below.

b) Online Computation

In the online computation part, the user would send his query via the web interface and the

request would be handled by a servlet. Here, the servlet processes the user request and sends the

user query to query parser for evaluation. The query parser would evaluate the validity of thequery like removing unwanted words, parenthesis, quotes etc., and sends it back to servlet. The

servlet sends the processed query to the indexer where the actual searching of the query would

take place. The search result sets would be returned back to the user via the same servlet. The

pictorial representation of the whole process is given below.

Page 3


4/23

Introducing Inverted Index:

Indexer is able to retrieve efficient search results because, instead of searching the documents

directly, it searches an index instead. This would be the correspondent of retrieving pages in a

book related to a keyword by searching the index at the back of a book, as opposed to searching

the words in each page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure

(page ->Keywords) to a keyword-centric data structure (word->pages).

IV. FUNCTIONALIES

The functionalities included in this project are as follows.

a) Boolean Search.

b) Ranking Result sets.

c) Phrase Search.

Page 4


5/23

d) Wildcard Search.

a) BOOLEAN SEARCH

Boolean search allows users to combine search results using the operators AND, OR & NOT.

We can add a clause to a BooleanQuery using the below method.

public void add(Query query, BooleanClause.Occur occur)

Where occur can be BooleanClause.Occur.MUST, BooleanClause.Occur.SHOULD or

BooleanClause.Occur.MUST_NOT.

BooleanClause.Occur.MUST:

This BooleanClause is used to AND two or more search results returned by the indexer.

For eg, (query1 AND query2 AND query3)

BooleanQuery.add(query1, BooleanClause.Occur.MUST)



BooleanClause.Occur.SHOULD:

This BooleanClause is used to OR two or more search results returned by the indexer.

For eg, (query1 OR query2 OR query3)

BooleanQuery.add(query1, BooleanClause.Occur.SHOULD)



BooleanClause.Occur.NOT:

This BooleanClause is used to NOT two or more search results returned by the indexer.

Page 5


6/23

For eg, (query1 NOT query2)

BooleanQuery.add(query1, BooleanClause.Occur.AND)

BooleanQuery.add(query2, BooleanClause.Occur.NOT)

b) Ranking Result sets:

We have implemented an algorithm which would rank results based on number of reviews,

review date posted and rating of each product. The motivation behind implementing this

algorithm is that some of the existing algorithms were totally dependent on averaging rating of

the products. They failed to consider the number of reviews and review dates into consideration

which would bring some of the old products with very less reviews and high ratings into top. We

will look into some of the ambiguous scenarios that would affect the quality of ranking.

Scenario 1:

Page 6


7/23

Two products are being compared in the above picture. The first product has got only one review

with a rating of 5. Naturally, the average rating for the first product is 5. On the other hand, the

second product has got 4 reviews with an average rating of 4.5. It would be ambiguous, if we just

consider average rating as the only criteria and rank results. Here, product 2 looks an obvious

selection for the user because it is reliable with many reviews.

Scenario 2:

There is a chance in the review date for product one in scenario 2. The product is relatively new

to the market and it has got one review with rating 5. The second product remains the same. In

this case, the user may like product one, because it is new to the market and there is every

change that it would get good ratings in the future. If this product does not get good ratings in the

future, the algorithm would bring it down, as it also considers average days per single review.

Page 7


8/23

Algorithm Steps:

1) Calculate the maximum number of reviews and minimum time difference among all

the products.

2) Iterate one product at a time.

Calculate average ratings and average days per single review.

tempWeightedAverage=((No of Reviews/maximum number of reviews)*Average

rating)

WeightedAverage=((Minimum time difference/Avg days per

Reviews)*tempWeightedAverage)

Add the product with its corresponding weightedAverage into the HashMap.

1) Terminate after finding weighted average of all the products.

How to calculate maximum number of reviews? - For example, consider the dataset has got

1000 products. If 100th product has got 50 reviews which is maximum among all the products,

then maximum number of reviews is 50.

How to calculate minimum time difference? - Minimum number of time difference in daysbetween the current date and the product launch date.

The generated HashMap would then be dumbed into TreeMap and sorted accordingly with

respect to the WeightedAverage.

c) Phrase Search

A Query that matches documents containing a particular sequence of terms. A PhraseQuery is

built by QueryParser for input like title: "Sony 12 Megapixel".

For queries enclosed with double quotations, the indexer would look for the exact match in the

inverted index matrix.

Page 8


9/23

V. RESULTS

UI Interface for advanced Search engine

Notes:

i) User can search for multiple products in the same query.

For eg) Query: (Brand: Sony Camera) or (Brand: Sony Computer)

ii) User can build complex Boolean query with the combination of AND, OR, NOT.

iii) User can select a maximum of 1000 results per page.

Page 9


10/23

iv) User can select ranking if needed.

Result Set page

Query: (Brand: Sony Camera) or (Brand: Sony Computer)

Page 10


11/23

Ranked Results

Query: (title: canon)

Page 11


12/23

Notes:

i) Here, the ranking is based on the number of reviews, review dates and rating of each

product.

ii) If the product has got no reviews or ratings, it is given a rating 0.

iii)The rating is given on a scale of 0 to 5.

Wild Card Search

Query: (title: can*n) AND (Feature: S?R)

Page 12


13/23

Notes:

i) * and ? can be used as a wild card characters.

ii) Multiple wild cards can be used in a term to match query strings.

iii) The wild cards can be used anywhere in the string but cannot be used in the begging of the

string.

Phrase Search:

Query: Feature: (Compatible With Select Canon Digital SLR Cameras)

Page 13


14/23

VI. EFFICIENCY EVALUTAION

The system was efficient enough to retrieve 1000 results in less than 2 seconds (approximately).

We have taken 30,000+ dataset for evaluating the performance of the system. The system was

also evaluated with a load testing tool called Web performance Load tester to evaluate the

load testing capability of the system. Since the application was deployed in a single tomcat

server, there were some failures when the system was load tested with 7 users at the same time.

This problem can easily be solved with a concept called Tomcat Clustering where many

tomcat servers can be clustered and the traffic can be load balanced.

The output of the tool is published below.

Performance Goal Analysis

Page 14


15/23

Page Duration

Page 15


16/23

Page Completion Rate

Transaction (URL) Completion Rate

Page 16


17/23

Failures

Bandwidth Consumption

Waiting Users

Page 17


18/23

Summarized by the selected user levels, this table shows some of the key metrics that reflect theperformance of the test as a whole.

Time Based Analysis:

Page 18


19/23

Page Duration:

Page Completion Rate

Transaction (URL) Completion Rate

Page 19


20/23

Failures

Bandwidth Consumption

Page 20


21/23

Waiting Users

Page 21


22/23

Test summary metrics

Sorted by the elapsed test time, this table shows some of the key metrics that reflect theperformance of the test as a whole.

VII. CONCLUSION

This paper has introduced an efficient keyword search engine on XML dataset with ranked

results based on the total number of reviews, review date posted and rating for each product. The

data extraction from XML file was done through DOM & SAX parser. The advanced search was

implemented through AND, OR & NOT. The paper then went on to explain the concepts of

phrase search and wild card search that would enhance the users searching experience. An

elaborate performance analysis of the system was furnished with graphs and tables.

Page 22


23/23

VIII. FUTURE WORKS

i) Based on metrics such as Click-through rate and Conversion rate, the system can be

trained better to provide more relevant results.ii) Better personalization based on user profiles. By considering the user activities over time,

the system can provide better personalized results for each user.

IX. REFERENCES

Make sure to give credit to any papers you used to get ideas for algorithms.

P 23

final report - advanced search engine

Documents