open source search tools for conferencesourcesearchtools of open search tools: tutorial
DESCRIPTION
Presentation by Ted DRAKE and Rosie JONES for the www2010 conference in North Carolina. This discusses the open source search software, APIs and trends.TRANSCRIPT
Applications of Open Search Tools: WWW2010 TutorialRosie Jones and Ted Drake
Yahoo! Inc
April 26th, 2010
- 2 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Introductions
- 3 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Schedule
2:00 – 2:15 Introductions and Overview Rosie & Ted
2:15 – 2:30 Motivation – state of the industry
Ted Drake
2:30 – 3:00 Indexing and Search Rosie & Ted
3:00 – 3:30 Hello World! Using Search Service APIs & Examples
Ted Drake
3:30 – 4:00 Coffee Break
4:00 – 4:30 Mashup Patterns Ted Drake
4:30 – 5:00 Ranking and Evaluation Rosie Jones
5:00 – 5:30 Discussion, Questions Ted & Rosie
- 4 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Caveat
• There is a lot of open search software out there!
• This tutorial is breadth-oriented, and example driven
– And therefore necessarily kind of shallow
For the slides:[email protected]@yahoo-inc.comhttp://www.slideshare.net/7mary4
- 5 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Schedule
2:00 – 2:15 Introductions and Overview Rosie & Ted
2:15 – 2:30 Motivation – state of the industry
Ted Drake
2:30 – 3:00 Search and Indexing Rosie & Ted
3:00 – 3:30 Hello World! Using Search Service APIs & Examples
Ted Drake
3:30 – 4:00 Coffee Break
4:00 – 4:30 Mashup Patterns Ted & Rosie
4:30 – 5:00 Ranking and Evaluation Rosie Jones
5:00 – 5:30 Discussion, Questions Ted & Rosie
- 6 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Motivation
- 7 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
State of the Industry - Mashups
Programmable Web: Resource for API and Mashup development
• 10 new search mashups every month (average)
• 62 search APIs (as of April 25,2010)
- 8 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
State of the Industry - Healthy Market
1,500 search related companies on TechCrunch
- 9 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Open Source Technology Reduces Barriers
• Yahoo! Query Language – Select * from (insert your desire)
– Built in cache, threading, authentication
– Easily extended with Open Tables
• Hadoop– Yahoo Distribution of Hadoop includes patches and updates
– Your Hadoop installation can perform at your current scale• All the way up to Yahoo scale
• Open Source Search Engines – Lemur
– Lucene
- 10 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Motivation II: Tools for Academic Papers
- 11 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
In Academia: Paper in WWW 2010
• Highlighting Disputed Claims on the WebRob Ennals, Beth Trushkowsky, John Mark Agosta, Tye Rattenbury, Tad Hirsch
The server uses Yahoo BOSS2 to search the web for snippets that resemble a paraphrase entered by the user.
- 12 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
In Academia: Papers from SIGIR 2008
• Towards breaking the quality curse: a web-querying approach to web people search
[ Kalashnikov et al SIGIR 2008]
– Web as external corpus
– Use Yahoo! API to retrieve
• Emulating query-biased summaries using document titles [Joho et al SIGIR 2008]
– Yahoo!, Google, Terrier (TREC)
- 13 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
More Publications using Open Source Search Engines
• Affective feedback: an investigation into the role of emotions in the information seeking process [ Arapakis et al SIGIR 2008]
– Use Indri to parse and retrieve TREC newswire and web collections
• [Jung et al IP&M 2007]
– Last clicked document is predictor of relevance (used Nutch search engine on university website)
• Minimal test collections for retrieval evaluation [Carterette et al SIGIR 2006]– Indri, Lemur, Lucene, Mg, SMART, Zettair
- 14 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Schedule
2:00 – 2:15 Introductions and Overview Rosie & Ted
2:15 – 2:30 Motivation – state of the industry
Ted Drake
2:30 – 3:00 Search and Indexing Rosie & Ted
3:00 – 3:30 Hello World! Using Search Service APIs & Examples
Ted Drake
3:30 – 4:00 Coffee Break
4:00 – 4:30 Mashup Patterns Ted & Rosie
4:30 – 5:00 Automatic Evaluation Rosie Jones
5:00 – 5:30 Discussion, Questions Ted & Rosie
- 15 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Web Search Architecture
Crawlers
Find documentsFollow linksFetch freshest contentBuild graph of hyperlinks
Indexers
Process text and meta-data - compressed - for quick lookup
Index
Text and meta-data - compressed - for quick lookup
Offline
Retrieval
Find documentscontaining query words
Ranking
InterfaceRuntime
- 16 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
What is Open Search
- 17 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Open Source Search and Open Search
Open source code lets you build your own search engine
Open search lets youleverage existing commercialsearch engines
- 18 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Why Open Search?
#!/usr/local/bin/perl –w
$searchResultPage = GET($url);
process($searchResultPage)
…
Curl (php)
Javascript…
- 19 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Scraping Modules
http://search.cpan.org/~jfriedl/Yahoo-Search-1.10.13/lib/Yahoo/Search.pm
- 20 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Do I Look Like A Piece of Bad Software?
- 21 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Information Superhighway for Known Robots
Search engine may stop accepting requests from your IP, or just slow down service
- 22 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Scrape with Search Engine’s Blessing
• http://code.google.com/apis/ajaxsearch/
• http://msdn.microsoft.com/en-us/library/dd251056.aspx
• http://developer.yahoo.com/search/boss/
MUCH MORE DETAIL IN THE NEXT SECTION!
- 23 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Other Parts to the Search Process
• Indexing
– Indexing algorithms
– Access to the index – what is overall document frequency? What if I rank differently using the index?
• Presentation
– User interface effects
• Existing Open Search Platforms Can Get You Started
- 24 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Indexing Your Own Content
- 25 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Task of Indexing
• Store document contents in format that allows quick lookup
• Invest time offline
– For fast runtime access
• Runtime task
– Given the current query
– Which subset of documents should we spend time ranking
- 26 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Brute Force Document Scoring
• Check every document in collection to see if it contains any query terms
– Most documents don’t contain any of the query terms
– Look at query terms to see which documents to consider
- 27 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
27
open
drake
search
ted
D1
D67
D3
D92
…
query= open search ted drake
D8 D9 D15 D32
D1 D9 D46
mit D3 D8 D9 D15 D32
D1 D6 D9 D15 D32
D3 D8 D9 D15 D32
PostingPosting list
D1
D3
D8
D9
D15
D32
D6
D46
Inverted Index
- 28 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
High Level Comparison
Platform License Lang. Docs Ranking Users Parallel Scale
Lucene Apache Java Many Flexible Amazon Yes TB
zettair BSD like
C HTML, TREC,
TXT
Flexible Research No TB
Indri BSD like
C++ Many Very Flexible
Research Yes TB
Sphinx GPL C++ Many Flexible craigslist Yes TB
RDBMS BSD, GPL
C SQL Text
Limited - Maybe GB
Xapian GPL C++ Many Flexible gmane Yes TB
- 29 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Previous Benchmarks (Middleton+Baeza-Yates 07)
- 30 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Open Search Benchmarking
• http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
– An over the weekend experiment to make code examples
- 31 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Benchmarks
• Not enough comparative benchmarks out there
• Hard to do; we really need standards– Optimize each platform, per hardware and data set
– Lot of platforms, with different APIs, options and numerical settings
• Need good diverse data sets, small & large
• Hard to please– Winners & losers in benchmarks; lot of biases
– Always room for improvement
• Really evolutionary to nail benchmarks
– It’s an Open Source project• http://github.com/zooie/opensearch/tree/master
- 32 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
In action
Lucene
Sphinx
Indri
All the code examples are here:http://github.com/zooie/opensearch/tree/master
- 33 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Lucene
• Lot of industrial support w/ proven scalability– Amazon, Netflix, Wikipedia
• An IR Library in Java– There’s also pyLucene & CLucene
• Use Nutch, Solr or Hounder for the rest– Crawlers, result abstracts…
- 34 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Lucene Indexing
- 35 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Lucene Search
javac -cp /lucenedir/lucene-2.4.1/lucene-core-2.4.1.jar:. Index.java
java –Xmv512m –cp /lucenedir/lucene-core-2.4.1.jar:. Index
- 36 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Sphinx
• Runs Craigslist Search
• MySQL integration focus– But also supports a XML input pipe
• Pretty fast indexer
• searchd, indexer commands
• Mostly declarative option setting (sphinx.conf)
• Client API (python, Java, ruby, php) sockets to searchd
- 37 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Sphinx Indexing
• SQL text columns or XML input
• sphinx.conf• indexer --quiet --config sphinx.conf medindex
- 38 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Sphinx Search
Socket connection to searchd Sphinx service
- 39 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Indri
• Lemur Project– http://www.lemurproject.org/
• Powerful Structured Query Language
• Advanced Language Models
• Native C++; swigged Java, php
• Command line binaries
• Developer resources– http://lemur.wiki.sourceforge.net/
- 40 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Indri: Hello World
• Index & Search directory of txt files
• IndriBuildIndex -index=/Users/viksi/sigir/med_data/indri_index -corpus.path=/Users/viksi/sigir/med_data/indri_data -corpus.class=txt -memory=300m
– http://www.lemurproject.org/lemur/indexing.php#IndriBuildIndex
• IndriRunQuery -index=/Users/viksi/sigir/med_data/indri_index -count=100 -rule="method:dirichlet,mu:2500" -query="#weight(1.0 #uw2(chest pain) 2.0 #1(heart attack))”
– http://www.lemurproject.org/lemur/IndriQueryLanguage.php
- 41 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Indexed Info in Search API
- 42 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Index - Structured Meta Data
SearchMonkey:
Yahoo! SearchMonkey captures the structured data from web sites for the index.
• RDF
• Microformats
- 43 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Index - Social
Social:
• Delicious saves/tags
• FOAF (Friend of a Friend), XFN
• Recent social activity: Twitter, Facebook, Buzz, Blogs…
- 44 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Index – Machine Tags
• Keyterms
• Mis-Spelling
• Content Enrichment
• Inbound Links
- 45 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Schedule
2:00 – 2:15 Introductions and Overview Rosie & Ted
2:15 – 2:30 Motivation – state of the industry
Ted Drake
2:30 – 3:00 Search and Indexing Rosie & Ted
3:00 – 3:30 Hello World! Using Search Service APIs & Examples
Ted Drake
3:30 – 4:00 Coffee Break
4:00 – 4:30 Mashup Patterns Ted & Rosie
4:30 – 5:00 Automatic Evaluation Rosie Jones
5:00 – 5:30 Discussion, Questions Ted & Rosie
- 46 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Hello, World!Open Search Service APIs
Photo by Oskay
- 47 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Roadmap of APIs
• Google• Bing• BOSS• Twitter• YQL• Live examples
Photo by Scorpions and Centaurs
- 48 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Google AJAX Search
• Javascript Widget or API
• REST API:• http://ajax.googleapis.com/ajax/servic
es/search/{vertical}?v=1.0&q={query}
• Web, Local, Video, Blogs, News, Books, Images, Patents
• Can’t modify results though
- 49 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Google Custom Search
• Turn-key product• Bulk load 1000s site restricts; On-demand 24 hour Web Indexing• Iframe or Custom Search Element results for developers; XML for enterprise
- 50 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Bing 2.0 API
• Multiple Sources; Batch support– Web, Images, InstantAnswer, Phonebook, RelatedSearch, Spell
• Usage: http://api.search.live.net/json.aspx?AppId={appid}&Market=en-US&Query={query}&Sources=web+spell&Web.Count=1
• Can modify (w/ some restrictions, i.e. re-ranking, blending with non-Bing sources)
- 51 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Yahoo! BOSS
• BOSS = Build your Own Search Service
• Open Yahoo’s core search features via web services to let 3rd parties revolutionize Search
• Unrestricted
- 52 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Unrestricted?
• Unlimited queries• Blend, re-order, discard• Full Presentation control• Limited only by your imagination
- 53 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
BOSS API
• Usage– http://boss.yahooapis.com/ysearch/{vert}/v1/{q}?appid={appid}&start=0&count=10&lan
g=en&format=xml&view=keyterms
• Verticals– Web, News, Images, Spelling
– In query syntax– inurl, url, intitle, site, AND/OR, “-”, “+”
• Notable web view fields– Delicious bookmarks– SearchMonkey (microformats)– Larger abstracts– Extracted Entities (keyterms)
• Can modify
SearchMonkeySearchMonkeySearchMonkeySearchMonkey
keytermskeytermskeytermskeyterms
BookmarksBookmarksBookmarksBookmarks
- 54 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Web = Cross Platform
• Google AJAX, Bing, BOSS• HTTP GET, URI => XML, JSON• Any programming lang. that supports HTTP
• Many language specific libraries available– Web Search “[platform] [language]”
• “yahoo boss python”
• Mobile: HTML web apps work on all smart phones
- 55 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Platforms
- 56 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Yahoo! YQL
• select * from internet API (e.g. flickr, ebay, amazon)– http://developer.yahoo.com/yql/
many standard & “open tables” services »
- 57 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Amazon Web Services (AWS)
• Amazon Cloud Support
• Amazon SimpleDB, Relational Database Services
• E-Commerce Fulfillment Services
• Messaging
• Monitoring
• Networking
• Payments & Billing
• Storage
• Workforce: Amazon Mechanical Turk
Large scale functionality at startup prices
- 58 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Google App Engine
• Free application hosting (up to 5 million pv/month)
• Java, Ruby, or Python
• Extensive SDK support
• Distributed Data Storage (up to 500 mb for free)
- 59 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Examples
- 60 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
BOSS Out in the Open• http://www.xurch.com• http://search.techcrunch.com• http://www.spysee.jp• http://www.123people.com• http://www.pipl.com• http://tweetnews.appspot.com• http://bossy.appspot.com• http://www.hakia.com• http://oneriot.com• http://www.daylife.com• http://www.inquisitorx.com/• http://insiderfood.com/• http://ask-boss.appspot.com/• http://www.4hoursearch.com• http://www.devunity.com (Techcrunch 50)• http://copyrightspot.com/ (Mashable)• http://imusicmash.com (Mashable)• http://truevert.com (Mashable)• http://professeurs.esiea.fr/wassner/?2008/10/20/171-semantic-calculator• http://www.ysearchblog.com/archives/000613.html• http://www.ysearchblog.com/archives/000621.html
– DNS Mashup– BuildASearch– PlayerSearch– V3GGIE– Dipidity Newsline– Tianamo
- 61 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Google Custom Search Examples
• CopyScape – Looks for sites copying your text
• Topicalizer – Extracts topics, finds related information from text
- 62 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Bing Examples
• Site Search Engine by a Microsoft engineerhttp://nathanbuggia.com/blog/post/Custom-Site-Search-Engine-Using-the-Live-Search-API.aspx
- 63 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Coolest Features Across the Board
• BOSS se_link (graphs), delicious (bookmarks), keyterms (extracted entities), searchmonkey (rdfa, microformats, structured abstracts)
• Yahoo! YQL
• Bing Video, Translation, Instant Answer, Batch
• Google CSE large site restricts, refinements• Google AJAX Transliteration, Blogs, Books
- 64 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Schedule
2:00 – 2:15 Introductions and Overview Rosie & Ted
2:15 – 2:30 Motivation – state of the industry
Ted Drake
2:30 – 3:00 Search and Indexing Rosie & Ted
3:00 – 3:30 Hello World! Using Search Service APIs & Examples
Ted Drake
3:30 – 4:00 Coffee Break
4:00 – 4:30 Mashups Ted Drake
4:30 – 5:00 Automatic Evaluation Rosie Jones
5:00 – 5:30 Discussion, Questions Ted & Rosie
- 65 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Mashups
- 66 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Let’s Build Something
• TweetNews– http://tweetnews.appspot.com/search?q=twitter– “the best mashup we’ve ever seen” (Wired)
• Tools– BOSS, BOSS Mashup Framework, Google App Engine,
Python 2.5
• Source– http://vik.singh.googlepages.com/fresh.zip
- 67 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Digression: TF-IDF for Ranking
• TF = Term Frequency
– Documents containing the query terms often tend to be relevant
• IDF – Inverse Document Frequency
– Words that are in every document aren’t as important
• The, of, “click here”, “home page”– Document frequency: number of documents containing this
term
– Divide by Document frequency: Inverse Document Frequency
• Sort by TF * IDF to get a ranking over documents
- 68 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
TweetNews Model• Goal: Inject relevance in latest news search results
• Approach:– Fetch latest (order by date) news results for query– Also fetch latest tweets for query (search.twitter.com)– Vectorize each Twitter and News search result– Euclidean Normalized TFIDF document vector of term:freq pairs– Compute cosine sim between each twitter & news result vector– Assign tweet to news result if sim >= threshold– Sort news results by # of related tweets
• WWW2010 similar paper
– Time is of the Essence: Improving Recency Ranking Using Twitter DataAnlei Dong, Ruiqiang Zhang, Pranam Kolari, Bai Jing, Yi Chang, Fernando Diaz,Zhaohui Zheng, Hongyuan Zha
- 69 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
TweetNews Main Source
- 70 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Non-Search: delicious Classifier
Usage• &view=delicious_toptags• &view=delicious_saves
Idea: Liberal v. Conservative Classifier
1. Generate politics queries list• Mine Reuters or editors
2. BOSS search each; take top 1k results
3. Filter on tag ‘liberal’ or ‘conservative’; assemble binary training set
4. Features“&abstract=long”, “&view=keyterms,
delicious_saves, searchmonkey_rss”, title, url, date, se_link # inbound links
- 71 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Mashup: Related terms
• Delicious users can tag web sites they bookmark.
• Get a ranked list of tags for a general topic
• select delicious_toptags,title from search.web where query="hadoop" and view="delicious_toptags“
- 72 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Mashup – Social Impact
• What are your friends buzzing, digging, tagging…
• YQL: select * from social.connections.updates where guid=me
• Use data to find more recent and relevant information
• Lijit creates a vertical search engine based on a user’s delicious, facebook, and other saved bookmarks
• WWW2010 Related Paper: Liquid Query: Multi-domain Exploratory Search on the Web Marco Brambilla, Alessandro Bozzon, Stefano Ceri, Piero Fraternali
• Now it’s time to turn on the FIRE HOSE
- 73 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Mashup – The Fire Hose
- 74 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Mashup – Government Data
• Guardian’s World Government Data Collection http://www.guardian.co.uk/world-government-data– U.S. Unemployment Statistics
– U.S. Aviation Accidents
– Raw Data for U. S. Department of Energy (DOE) Categorical Exclusion(CX) Determinations Under the National Environmental Policy Act (NEPA)
– Treasury Recovery Act Data
– Migratory Bird Flyways - Continental United States
- 75 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Coming Soon: Twitter Annotations
Metadata for tweetsStep 1. create link for users to tweet your page.
Step 2. Insert metadata into each tweet
Step 3. Pull that information back and mash with other data.
Example• Yahoo! Finance has a tweet this stock link. • Insert information (ticker:yhoo) into the tweet’s metadata. • Follow the distribution of this metadata and look for
correlations in stock price activity. Perhaps a new line on Finance Charts.
- 76 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Mashup – Open Tables on YQL
– Define new API definitions
– Open Source in GitHub
– Server-side JavaScript allows Insert and more
– Allows for private keys
- 77 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Mashup – Open Tables on YQL<?xml version="1.0" encoding="UTF-8"?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd"> <meta> <author>Nagesh Susarla</author> <documentationURL>See search.web and search.images for more details</documentationURL> </meta> <bindings> <select itemPath="results.result" produces="XML"> <inputs> <key id="query" type="xs:string" paramType="query" required="true"/> </inputs> <execute><![CDATA[ var qs = query; var search = y.query('select * from search.web(50) where query=@query', {query: qs}).results; var images = []; default xml namespace='http://www.inktomi.com/'; for each (var result in search.result) { images.push(y.query('select * from search.images(1) where query=@query and url=@url', {url:result.url, query:qs})); } var i = 0; for each (var result in search.result) { var image = images[i].results.result; if (image) { result.image = <image>{image}</image>; } i++; } response.object = search; ]]> </execute> </select> </bindings> </table>
- 78 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Mashup – Using an Open Table
- 79 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Blending Vertical + Service
Comprehensiveness!Every Search Engine should be a One-Stop Shop
- 80 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Delicious Blending Idea
• Goal: Blend delicious + web results
• Approach:– 1000s BOSS Web Queries, Filter w/ delicious_saves– Training set: x: search features | y: delicious count
– Machine learn the transfer function• Infer the delicious count for any web result• Can now normalize the two search result sets
- 81 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
From WebFrom WebFrom WebFrom Web
- 82 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Hack Ideas
Discovery (BOSS Search App Store)• Designing a fairer marketplace for app distribution• Emerging problem for Facebook, iPhone App Store
Desktop, Data Visualization (Cooliris, Inquisitor)
Mobile (iPhone, Android, BlackBerry)• Passive Location/Contextual Based Search
Social (Facebook, Twitter, OpenSocial, Friend Connect, OneConnect)
Semantic• BOSS keyterms, SearchMonkey• Bing Instant Answers• Google CSE Refiners
- 83 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Schedule
2:00 – 2:15 Introductions and Overview Rosie & Ted
2:15 – 2:30 Motivation – state of the industry
Ted Drake
2:30 – 3:00 Search and Indexing Rosie & Ted
3:00 – 3:30 Hello World! Using Search Service APIs & Examples
Ted Drake
3:30 – 4:00 Coffee Break
4:00 – 4:30 Mashups Ted Drake
4:30 – 5:00 Ranking and Evaluation Rosie Jones
5:00 – 5:30 Discussion, Questions Ted & Rosie
- 84 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Ranking
- 85 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Retrieval and Ranking
• RETRIEVE the documents matching simple conditions
– Boolean AND on query terms
– TF-IDF
– …
• RANK using more sophisticated function
– Term proximity
– Page authority
– Author identity
– …
- 86 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Ranking with Open Source Tools
• Indri/Lemur
– Language modeling
– BM25, Okapi, Cosine similarity, inQuery
• Lucene
– TF-IDF, weighted by term occurrences
– Fielded search
• Terrier
– Okapi BM25, language modeling and TF-IDF
– Divergence from Randomness
• Your own re-ranking code using open search
- 87 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Evaluation with Click Logs
- 88 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Evaluating with Clicks
People click on the good results, right?
- 89 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Not All Results Are Equally Likely to be Looked At
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
- 90 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Clicks and Views Depend on Rank
[Joachims et al, 2005]
- 91 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Evaluation from Click Logs
• Show a screenshot and me doing a “skip first”
Read FromTop toBottom
[Joachims et al SIGIR 2005]
- 92 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Mining Clicks for Ranking
• Clicks can be used to predict
– Pairwise preference
• Query: Doc1, Doc2 [ Joachims 2002]
– Absolute relevance
• Taking clicks on other documents into account
• [Carterette and Jones, NIPS 2007]
• [Chapelle and Zhang, WWW 2009]
- 93 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Interleaving for Learning from Clicks – Pairwise Judgments
• [Joachims, KDD 2002]
• [Radlinski and Joachims, KDD 2007]
• [Radlinksi et al, CIKM 2008]
Results from Method 1 Results from Method 2
- 94 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Evaluation using Discounted Cumulative Gain
• Discounted Cumulative Gain (DCG)
• [Järvelin and Kekäläinen 2000]
Highly relevantValue = 3
Somewhat relevantValue = 2
Tangentially relevantValue = 1
IrrelevantValue = 0
Most importantValue = 1
Less importantValue = 1/log(i)
- 95 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Directly Modeling Relevance From Clicks
Rank 1Rank 2Rank 3Rank 4Rank 5
Rank 1Rank 2Rank 3Rank 4Rank 5
Click count 1
Is DCG1 > DCG2?
P(DCG1 > DCG2)
Which ranking of web pages is better for the query “NIPS 2007”?
[Carterette and Jones, NIPS 2007]
- 96 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Ingredients for Learning from Clicks
• Sufficient users
• Ability to record results shown
• Ability to vary presentation order
• Ability to vary results shown
• Ability to log clicks
• Ability to run experiments
varying your secret sauce
- 97 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
- 98 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
How to Get Search Engine Results to Modify?
• Radlinski and Joachims
• citeseer/arXiv.org results and permuted rankings, recorded clicks, skip above, skip next
• See also their open source engine Osmot
– http://radlinski.org/osmot/
- 99 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Query Logs• Might be in /etc/httpd/logs/access_log* check httpd.conf
• [IP] - - [Time] “[Method] [URI] [Version] [Code] “[Referrer]” “[User-Agent]”
– 10.66.91.231 - - [08/Jun/2009:21:24:44 -0700] "GET /search?q=awesome+presentation HTTP/1.1" 200 2940 "http://i_was_referred_from_here.com" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315 Firefox/3.0.10 Ubiquity/0.1.4”
• Tip: Instrument as much as possible in GET URI via CGI parameters– search?q=yahoo®ion=us&tab=local&device=mobile&advanced=1
– One log, avoid joins; URI must < 2k bytes
• grep, cut, uniq, wc, sort, cat are your friends– Ex. Count user query sessions (session key = IP+hour)– sudo grep ’/search?q=' /etc/httpd/logs/access_log.1 | cut -d' '
-f1,4 | cut -d':' -f1,2 | uniq | wc –l
• For advanced SQL processing on single machine: sqlite3 import script– http://selinap.com/2008/04/python-parse-apache-log-to-sqlite-database/
• Distributed: Hadoop & Pig– http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/
- 100 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Other Wishlist Items
• A good baseline
– Motivate your users to use your engine
– More fun than reading newspaper stories from 1997
• Evaluate something that is different from ranking
– Summarization
– Information extraction
• Or improve on existing ranking
• NLP tasks “take top results and do X…”
– Data mining
• Pseudo-relevance feedback
- 101 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Reasons to Build a Demo
“Eat Your Own Dogfood”algorithm design and testing- allows you to improve without labeled data
- look closely at the results - convince your advisor/funders it works!
Observe user behavior
Cheap flight to bostonCheap flights to bostonCheap flightsTravelocityExpediaAmerican arlines.comAmerican airlines.comAmericanairlines.com
PuppyCute puppyMore cute puppy picutres
- 102 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
More About Logs and Evaluation in Other Tutorials
• Web Search Engine Metrics (Direct Metrics to Measure User Satisfaction) – Tuesday, 2:00 PM–5:30 PM
• Ali Dasdan, Yahoo! (USA)Kostas Tsioutsiouliklis, Yahoo! (USA)Emre Velipasaoglu, Yahoo! (USA)
• Web Search/Browse Log Mining: Challenges, Methods, and Applications – Today, 9:00 AM–5:30 PM
Daxin Jiang, Microsoft (China),Jian Pei, Simon Fraser University (Canada)Hang Li, Microsoft (
- 103 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
What Doesn’t Exist?
• Query log mining tools
– An opportunity for you!
• …
- 104 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Other Open Source Tools
- 105 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Lemur Query Log Toolbar
• Research community project for collecting query logs
– Sign up at http://lemurstudy.cs.umass.edu/
• Built and maintained by LTI CMU and CIIR UMass Amherst
• http://www.lemurproject.org/
- 106 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Book on Hadoop Scale Processing Coming Out
• Ivory: A Hadoop toolkit for Web-scale information retrieval
http://www.umiacs.umd.edu/~jimmylin/ivory/docs/index.html
• Jimmy Lin
- 107 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Take Home Messages
• You can evaluate with clicks
• You can collect clicks by building a useful / fun search service
• You can create a useful/fun search service using open search APIs
• You obtain implementations of standard retrieval algorithms with open source search engines
• Modify that code with your new techniques
- 108 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Pointers - Tools
[1] Indri Homepage. http://www.lemurproject.org/indri/..
[2] Lemur Toolkit Homepage. http://www.lemurproject.org/.
[3] Lucene Homepage. http://jakarta.apache.org/lucene/.
[4] Xapian Code Library Homepage. http://www.xapian.org/.
[5] Zettair Homepage. http://www.seg.rmit.edu.au/zettair/.
[6] Terrier Homepage. http://ir.dcs.gla.ac.uk/terrier/.
[7] Nutch Homepage. http://lucene.apache..org/nutch/.
[8] Sphinx search http://sphinxsearch.com/
- 109 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Mashup Resources
• Yahoo Developer Network: developer.yahoo.com
• Y.Q.L. : developer.yahoo.com/yql
• BOSS : developer.yahoo.com/boss
• Bing : bing.com/developers
• Google Search: code.google.com/apis/ajaxsearch/
• App Engine : code.google.com/appengine/
• A.W.S. : aws.amazon.com
• Programmable Web : programmableWeb.com
• Mashable : mashable.com
• Tech Crunch : TechCrunch.com
- 110 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
Acknowledgements
• Vik Singh co-wrote earlier version of this tutorial
• A few slides from Ricardo Baeza-Yates and Ben Carterette
• Andrew Tomkins, Wei Vivian Zhang, Ahmed Hassan, Eran Palmon for helpful feedback
- 111 -WWW 2010 Tutorial Open Search ToolsDrake & Jones
QA