phrase based indexing
DESCRIPTION
Slide on Phrase Based Indexing concepts.TRANSCRIPT
![Page 1: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/1.jpg)
Phrase Based IndexingBy
Bala Abirami
![Page 2: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/2.jpg)
• Introduction of Phrase Based Indexing• What is Phrase Based Indexing?• Back ground of Invention• Summary on Invention• Spam Detection
![Page 3: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/3.jpg)
Introduction
• An information retrieval system uses phrases to index, retrieve, organize and describe documents.
• It was a patent application submitted by the Google Engineer, Anna Lynn Patterson to US
• Application filed: July, 2004
• Published: January, 2006
![Page 4: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/4.jpg)
Background of Invention
• Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet.
• A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document.
• The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
![Page 5: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/5.jpg)
Cont…
• Concepts are often expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival".
• Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.
![Page 6: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/6.jpg)
Summary
An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection.
1. Identifying Phrases and Related Phrases2. Indexing Documents w.r.t Phrases3. Ranking Documents w.r.t Phrases4. Creating description for the document5. Elimination of Duplicate Documents
![Page 7: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/7.jpg)
Identifying Phrase and Related Phrases
• Based on a phrase's ability to predict the presence of other phrases in a document.
• It looks to identify phrases that have frequent and/or distinguished/unique usage
• Prediction measure is used for identifying related phrases
• Prediction measure relates Actual co -occurrence rate of two phrases to expected co-occurrence rate of the two phrases
• Information gain = actual co-occurrence rate : expected co-occurrence rate
![Page 8: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/8.jpg)
Cont…
• Two Phrases are related to each other when the prediction measure exceeds the prediction threshold.
• Example:
Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,
![Page 9: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/9.jpg)
Indexing documents based on related Phrases
• An information retrieval system indexes documents in the document collection by the valid or good phrases.
• Posting List = documents that contain the phrase
• Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase
![Page 10: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/10.jpg)
Ranking
• Ranking documents is based on two factors 1. Ranking Documents based on
Contained Phrases 2. Ranking Documents based on Anchor
Phrases• Document Score = Body Hit Score + Anchor Hit
Score• For Example: Body Hit Score = 0.30, Anchor
Hit Score = 0.70• Document Score = 0.30 + 0.70
![Page 11: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/11.jpg)
Phrase Extension
• The information retrieval system is also adapted to use the phrases when searching for documents in response to a query.
• A user may enter an incomplete phrase in a search query, such as "President of the“
Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."
![Page 12: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/12.jpg)
Descriptions for Documents
• Phrase information is used to create description of a document.
• System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences.
• Ranks the sentences based on the count.• Selects some number of top ranking sentences
as description and includes it in the search results.
![Page 13: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/13.jpg)
Eliminating Duplicate documents
• Identifying and Eliminating duplicate documents while crawling a document or when processing the search query.
• The description is stored in association with every document in a hash table.
• The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value.
• The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.
![Page 14: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/14.jpg)
![Page 15: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/15.jpg)
Functions of Indexing system • Indentifies Phrases in documents• Indexing Documents according to the
phrases by accessing various websites.
Functions of Front End Server
• Receives queries from a user• Provides those queries to the search system
![Page 16: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/16.jpg)
Functions of Searching System
• Searching for documents relevant to the search query
• Identifies the phrases in the search query• Ranking the documents
Functions of Presentation system
• Modifying the search results including removing of duplicate content.
• Generating topical descriptions of documents and provides modified
![Page 17: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/17.jpg)
Spam Detection
• “Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”.
• Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .
![Page 18: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/18.jpg)
Cont…
• A phrase based indexing system knows the number of related phrases in a document.
• A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection.
• A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.
![Page 19: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/19.jpg)
Advantages of Phrase Based Indexing
• Detecting Duplicate Pages
• Spam Detection
• Save time
![Page 20: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/20.jpg)
Other Patent Applications
• Phrase identification in an information retrieval system
• Phrase-based searching in an information retrieval system
• Phrase-based generation of document descriptions
• Detecting spam documents in a phrase based information retrieval system
• Efficient Phrase Based Document Indexing for Document Clustering
![Page 21: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/21.jpg)
According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines
Two-word phrases -- 28.38 percent Three-word phrases -- 27.15 percent Four-word phrases -- 16.42 percent One-word phrase -- 13.48 percent Five-word phrases -- 8.03 percent Six-word phrases -- 3.67 percent Seven-word phrases -- 1.63 percent Eight-word phrases -- 0.73 percent Nine-word phrases -- 0.34 percent Ten-word phrases -- 0.16 percent
![Page 22: Phrase Based Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062418/5552e043b4c905014c8b4d4f/html5/thumbnails/22.jpg)
Thank you