beauty ofir
DESCRIPTION
Information Retrieval is about how we can search and retrieve things. In this talk, we look at the various components that make up a typical search engine and discuss the associated challenges.TRANSCRIPT
Beauty of IR
Venkatesh VinayakaraoAn IR enthusiast!
Venkatesh Vinayakarao 2
Disclaimer
Most examples and discussions in this talk revolve around well known search engines. This is just to get
a good learning experience. Please keep in mind that IR is beyond search engines.
25+ slides of interesting discussion ahead…
2/2014
Venkatesh Vinayakarao 3
Quiz
1. Explain any two challenges in Query Intent Understanding using some examples and discuss why is it a hard problem?
2. How are “Tiles” as discussed in the class used in search engines? What purpose do they solve?
3. Search Engines have no UI related design concerns. True/False?
2/2014
Venkatesh Vinayakarao 4
About Me
BE Computer Science (Y2K)
MS (IT)
IT Service Industry
Start Up
Nokia
Yahoo
Microsoft (Bing)
PhD
Let me learn everything all
over again!
2/2014
Venkatesh Vinayakarao 5
Our Agenda: The Beauty of IR!
Crawling Content Processing Indexing
Me!
Query (Intent) Understanding
Ranking User Interface
Offline Horror!
Online Terror!
How to process Korean queries for
local listings?
2/2014
Venkatesh Vinayakarao 6
Crawling
How frequently should we crawl? Fresh & Super-Fresh! How to crawl cricket scores? Are we even
crawling here?
How to avoid 404 - Page not found? How much time did it take google to show your first personal
page?2/2014
Venkatesh Vinayakarao 7
Content Processing
Good Read: https://getlisted.org/static/resources/local-search-data-providers.html
2/2014
Venkatesh Vinayakarao 8
Content Processing
Query: “Schools in Delhi” Answer: “Delhi Public School” Good or Bad?
Query: “Schools in Hyderabad” Answer: “Delhi Public School” Good or Bad?
Query: “Hotels in Bombay” Answer: “Grand Hyatt, Mumbai” Good or Bad? How to get same results for both Mumbai and Bombay?
Query: “Maruti Car service in delhi” Answer: “Rana Motors Private Limited”. What happened?
2/2014
Venkatesh Vinayakarao 9
Content Processing & Indexing
A real example: http://www.yelp.com/dataset_challenge/
Enriched Business• Category Synonyms (for eg., auto service & car service are replaceable at times)• User’s query forms (for eg., McDonalds is commonly queried as McD)
2/2014
Venkatesh Vinayakarao 10
Derived Values & Indexing
Given a location, how will you find all businesses within 1km radius?
Query: schools near govindpuri delhi
2/2014
Venkatesh Vinayakarao 11
Query Understanding Challenge
Need a team of 3 people and one laptop.
Volunteers?
2/2014
Venkatesh Vinayakarao 12
Rules
I will give an entity name. You will have to frame at least three different
(dissimilar) queries (and as many as you can) that give same document as the correct result at first place.
At the end, you should submit: Query, Max. no. of top n correct results that you
maintained to be same. You will have 5 minutes.
2/2014
Venkatesh Vinayakarao 13
Questions
Tom Cruise Aishwarya Rai Tom Hanks Srikanta Bedathur Venkatesh Vinayakarao Pankaj Jalote Amir Khan Andre Agassi Manmohan Singh
2/2014
Venkatesh Vinayakarao 14
Query Understanding
Query: Michael Jordon Which MJ to return? The basketball player or actor?
Factors User profile Query context (session details, browser data, links, etc) …
Query: Delhi School What does user want? “Delhi Public School” or
“Schools in Delhi” or “some Indian school in US”? Query: “IR”
Predict top three results
2/2014
Venkatesh Vinayakarao 15
Ok! I give up!!
A frustrated search user: “please show me some t-shirt brands”
2/2014
Venkatesh Vinayakarao 16
More fun with auto completion
2/2014
Venkatesh Vinayakarao 17
System Overview (Simplified)
Front-end Front-end Front-end Front-end
Query Understanding, Query Classifiers
Web Answer Local AnswerFinance Answer
Tech Answer & Many more
KB
Index Serve Crawled Content
Crawler
WebExpanded Query
User Query
2/2014
Venkatesh Vinayakarao 18
Ranking & Relevance
How do we know if the document is relevant (in web search context)?
Popularity of url Domain score (is it ac.in or .edu?) TF, IDF Entity, Chain entity? Trust Factor (Wikipedia?) Inlinks/Outlinks Position of query terms Sequence of query terms … and 1000 of such things
2/2014
Venkatesh Vinayakarao 19
Are current search engines good at relevance & ranking?
Bing GoogleQuery1: Vegetarian hotels in south delhi
Query2: South Indian hotels in south delhi
2/2014
Venkatesh Vinayakarao 20
…More examples
Query3: South Indian restaurants in south delhi
What’s the difference between query2 and query3? Should search engines give different results?
2/2014
Venkatesh Vinayakarao 21
How far for a coffee?
Google: Just one word (iiitd) missing. So
what?
Let’s make the query as “coffee shops near iiitd delhi”.
“Coffee shops near me” gives results from Janakpuri, Gurgaon, CP & Kamla Nagar.
2/2014
Venkatesh Vinayakarao 22
Why is it hard?
What makes Ranking & Relevance hard?
2/2014
23
User Interface
Is UI important for search engine? Maps in local results Live sport score cards Finance tickers Filters Search Operators Entity Infoboxes
What impact does these make?
2/2014 Venkatesh Vinayakarao
Venkatesh Vinayakarao 24
Our Agenda: The Beauty of IR!
Crawling Content Processing Indexing
Me!
Query (Intent) Understanding
Ranking User Interface
Offline Horror!
Online Terror!
How to process Korean queries for
local listings?
2/2014
Venkatesh Vinayakarao 25
Evaluation
Various evaluation methods Precision/Recall Mean Avg Precision Mean Reciprocal Rank
If first relevant doc is at kth position, RR = 1/k. NDCG
Non-Boolean/Graded relevance scores DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
2/2014
Venkatesh Vinayakarao 26
NDCG - Example
i
Ground Truth Ranking Function1 Ranking Function2
Document Order
riDocument Order
riDocument Order
ri
1 d4 2 d3 2 d3 2
2 d3 2 d4 2 d2 1
3 d2 1 d2 1 d4 2
4 d1 0 d1 0 d1 0
NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203
4 documents: d1, d2, d3, d4
Taken from http://www.stanford.edu/class/cs276/handouts/EvaluationNew.ppt
2/2014
Venkatesh Vinayakarao 27
Are we done?
Q & A
2/2014