final report - advanced search engine

Upload: sathish1125

Post on 06-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Final Report - Advanced Search Engine

    1/23

    Final Report

    Advanced Keyword Search EngineBy

    Nikhil Pratap

    Sathishkumar Poornachandran

    CSE 511

    5/5/2011

    I. ABSTRACT

  • 8/3/2019 Final Report - Advanced Search Engine

    2/23

    We have introduced an efficient advanced keyword search engine on XML dataset with ranked

    results based on the total number of reviews, review date posted and rating for each product. The

    advanced search engine was implemented using Boolean operators AND, OR & NOT. Using

    these operators, the exact results will be shown to the user. We have also implemented features

    like wild card and Phrase search that would enhance the users searching experience.

    II. INTRODUCTION & MOTIVATION

    The existing problem with Amazon is that user can query the search engine based on certain pre-

    defined categories like Electronics, Computers, Books, Cosmetics etc., There are no options for

    users to do real time queries based on the attributes of the products. In this paper, we are

    proposing an advanced search query engine which would allow users to enter more inputs that

    can be used for filtering out search results based on relevancy. Our advanced search query option

    would also support operators like AND, OR, NOT that are just considered as a normal text

    strings in Amazon. Eg) Search query Electronics = (computers not laptop) would exactly fetch

    laptop information in Amazon. Here the word not is considered as just string. This paper also

    introduces the concept of phrase search, wild card search and an intuitive ranking algorithm

    based on number of reviews, review dates and ranking.

    III.ARCHITECTURE

    The main architecture of the system can be divided into two categories.

    a) Offline Computation.

    b) Online Computation.

    a) Offline Computation

    Page 2

  • 8/3/2019 Final Report - Advanced Search Engine

    3/23

    In the offline computation, we are extracting the data from the XML dataset via SAX & DOM

    PARSER and storing the data in the indexer file as inverted index. For our scenario, we are

    storing values as fields like title, brand, reviews, model etc. The illustration of the process is

    given below.

    b) Online Computation

    In the online computation part, the user would send his query via the web interface and the

    request would be handled by a servlet. Here, the servlet processes the user request and sends the

    user query to query parser for evaluation. The query parser would evaluate the validity of thequery like removing unwanted words, parenthesis, quotes etc., and sends it back to servlet. The

    servlet sends the processed query to the indexer where the actual searching of the query would

    take place. The search result sets would be returned back to the user via the same servlet. The

    pictorial representation of the whole process is given below.

    Page 3

  • 8/3/2019 Final Report - Advanced Search Engine

    4/23

    Introducing Inverted Index:

    Indexer is able to retrieve efficient search results because, instead of searching the documents

    directly, it searches an index instead. This would be the correspondent of retrieving pages in a

    book related to a keyword by searching the index at the back of a book, as opposed to searching

    the words in each page of the book.

    This type of index is called an inverted index, because it inverts a page-centric data structure

    (page ->Keywords) to a keyword-centric data structure (word->pages).

    IV. FUNCTIONALIES

    The functionalities included in this project are as follows.

    a) Boolean Search.

    b) Ranking Result sets.

    c) Phrase Search.

    Page 4

  • 8/3/2019 Final Report - Advanced Search Engine

    5/23

    d) Wildcard Search.

    a) BOOLEAN SEARCH

    Boolean search allows users to combine search results using the operators AND, OR & NOT.

    We can add a clause to a BooleanQuery using the below method.

    public void add(Query query, BooleanClause.Occur occur)

    Where occur can be BooleanClause.Occur.MUST, BooleanClause.Occur.SHOULD or

    BooleanClause.Occur.MUST_NOT.

    BooleanClause.Occur.MUST:

    This BooleanClause is used to AND two or more search results returned by the indexer.

    For eg, (query1 AND query2 AND query3)

    BooleanQuery.add(query1, BooleanClause.Occur.MUST)

    BooleanQuery.add(query2, BooleanClause.Occur.MUST)

    BooleanQuery.add(query3, BooleanClause.Occur.MUST)

    BooleanClause.Occur.SHOULD:

    This BooleanClause is used to OR two or more search results returned by the indexer.

    For eg, (query1 OR query2 OR query3)

    BooleanQuery.add(query1, BooleanClause.Occur.SHOULD)

    BooleanQuery.add(query2, BooleanClause.Occur.SHOULD)

    BooleanQuery.add(query3, BooleanClause.Occur.SHOULD)

    BooleanClause.Occur.NOT:

    This BooleanClause is used to NOT two or more search results returned by the indexer.

    Page 5

  • 8/3/2019 Final Report - Advanced Search Engine

    6/23

    For eg, (query1 NOT query2)

    BooleanQuery.add(query1, BooleanClause.Occur.AND)

    BooleanQuery.add(query2, BooleanClause.Occur.NOT)

    b) Ranking Result sets:

    We have implemented an algorithm which would rank results based on number of reviews,

    review date posted and rating of each product. The motivation behind implementing this

    algorithm is that some of the existing algorithms were totally dependent on averaging rating of

    the products. They failed to consider the number of reviews and review dates into consideration

    which would bring some of the old products with very less reviews and high ratings into top. We

    will look into some of the ambiguous scenarios that would affect the quality of ranking.

    Scenario 1:

    Page 6

  • 8/3/2019 Final Report - Advanced Search Engine

    7/23

    Two products are being compared in the above picture. The first product has got only one review

    with a rating of 5. Naturally, the average rating for the first product is 5. On the other hand, the

    second product has got 4 reviews with an average rating of 4.5. It would be ambiguous, if we just

    consider average rating as the only criteria and rank results. Here, product 2 looks an obvious

    selection for the user because it is reliable with many reviews.

    Scenario 2:

    There is a chance in the review date for product one in scenario 2. The product is relatively new

    to the market and it has got one review with rating 5. The second product remains the same. In

    this case, the user may like product one, because it is new to the market and there is every

    change that it would get good ratings in the future. If this product does not get good ratings in the

    future, the algorithm would bring it down, as it also considers average days per single review.

    Page 7

  • 8/3/2019 Final Report - Advanced Search Engine

    8/23

    Algorithm Steps:

    1) Calculate the maximum number of reviews and minimum time difference among all

    the products.

    2) Iterate one product at a time.

    Calculate average ratings and average days per single review.

    tempWeightedAverage=((No of Reviews/maximum number of reviews)*Average

    rating)

    WeightedAverage=((Minimum time difference/Avg days per

    Reviews)*tempWeightedAverage)

    Add the product with its corresponding weightedAverage into the HashMap.

    1) Terminate after finding weighted average of all the products.

    How to calculate maximum number of reviews? - For example, consider the dataset has got

    1000 products. If 100th product has got 50 reviews which is maximum among all the products,

    then maximum number of reviews is 50.

    How to calculate minimum time difference? - Minimum number of time difference in daysbetween the current date and the product launch date.

    The generated HashMap would then be dumbed into TreeMap and sorted accordingly with

    respect to the WeightedAverage.

    c) Phrase Search

    A Query that matches documents containing a particular sequence of terms. A PhraseQuery is

    built by QueryParser for input like title: "Sony 12 Megapixel".

    For queries enclosed with double quotations, the indexer would look for the exact match in the

    inverted index matrix.

    Page 8

  • 8/3/2019 Final Report - Advanced Search Engine

    9/23

    V. RESULTS

    UI Interface for advanced Search engine

    Notes:

    i) User can search for multiple products in the same query.

    For eg) Query: (Brand: Sony Camera) or (Brand: Sony Computer)

    ii) User can build complex Boolean query with the combination of AND, OR, NOT.

    iii) User can select a maximum of 1000 results per page.

    Page 9

  • 8/3/2019 Final Report - Advanced Search Engine

    10/23

    iv) User can select ranking if needed.

    Result Set page

    Query: (Brand: Sony Camera) or (Brand: Sony Computer)

    Page 10

  • 8/3/2019 Final Report - Advanced Search Engine

    11/23

    Ranked Results

    Query: (title: canon)

    Page 11

  • 8/3/2019 Final Report - Advanced Search Engine

    12/23

    Notes:

    i) Here, the ranking is based on the number of reviews, review dates and rating of each

    product.

    ii) If the product has got no reviews or ratings, it is given a rating 0.

    iii)The rating is given on a scale of 0 to 5.

    Wild Card Search

    Query: (title: can*n) AND (Feature: S?R)

    Page 12

  • 8/3/2019 Final Report - Advanced Search Engine

    13/23

    Notes:

    i) * and ? can be used as a wild card characters.

    ii) Multiple wild cards can be used in a term to match query strings.

    iii) The wild cards can be used anywhere in the string but cannot be used in the begging of the

    string.

    Phrase Search:

    Query: Feature: (Compatible With Select Canon Digital SLR Cameras)

    Page 13

  • 8/3/2019 Final Report - Advanced Search Engine

    14/23

    VI. EFFICIENCY EVALUTAION

    The system was efficient enough to retrieve 1000 results in less than 2 seconds (approximately).

    We have taken 30,000+ dataset for evaluating the performance of the system. The system was

    also evaluated with a load testing tool called Web performance Load tester to evaluate the

    load testing capability of the system. Since the application was deployed in a single tomcat

    server, there were some failures when the system was load tested with 7 users at the same time.

    This problem can easily be solved with a concept called Tomcat Clustering where many

    tomcat servers can be clustered and the traffic can be load balanced.

    The output of the tool is published below.

    Performance Goal Analysis

    Page 14

  • 8/3/2019 Final Report - Advanced Search Engine

    15/23

    Page Duration

    Page 15

  • 8/3/2019 Final Report - Advanced Search Engine

    16/23

    Page Completion Rate

    Transaction (URL) Completion Rate

    Page 16

  • 8/3/2019 Final Report - Advanced Search Engine

    17/23

    Failures

    Bandwidth Consumption

    Waiting Users

    Page 17

  • 8/3/2019 Final Report - Advanced Search Engine

    18/23

    Summarized by the selected user levels, this table shows some of the key metrics that reflect theperformance of the test as a whole.

    Time Based Analysis:

    Page 18

  • 8/3/2019 Final Report - Advanced Search Engine

    19/23

    Page Duration:

    Page Completion Rate

    Transaction (URL) Completion Rate

    Page 19

  • 8/3/2019 Final Report - Advanced Search Engine

    20/23

    Failures

    Bandwidth Consumption

    Page 20

  • 8/3/2019 Final Report - Advanced Search Engine

    21/23

    Waiting Users

    Page 21

  • 8/3/2019 Final Report - Advanced Search Engine

    22/23

    Test summary metrics

    Sorted by the elapsed test time, this table shows some of the key metrics that reflect theperformance of the test as a whole.

    VII. CONCLUSION

    This paper has introduced an efficient keyword search engine on XML dataset with ranked

    results based on the total number of reviews, review date posted and rating for each product. The

    data extraction from XML file was done through DOM & SAX parser. The advanced search was

    implemented through AND, OR & NOT. The paper then went on to explain the concepts of

    phrase search and wild card search that would enhance the users searching experience. An

    elaborate performance analysis of the system was furnished with graphs and tables.

    Page 22

  • 8/3/2019 Final Report - Advanced Search Engine

    23/23

    VIII. FUTURE WORKS

    i) Based on metrics such as Click-through rate and Conversion rate, the system can be

    trained better to provide more relevant results.ii) Better personalization based on user profiles. By considering the user activities over time,

    the system can provide better personalized results for each user.

    IX. REFERENCES

    Make sure to give credit to any papers you used to get ideas for algorithms.

    P 23