online learning for web query generation: finding documents matching a minority concept on the web...
Post on 15-Jan-2016
216 views
TRANSCRIPT
Online Learning for Web Query Generation: Finding Documents
Matching a Minority Concept on the WebRayid Ghani
Accenture Technology Labs, USA
Rosie JonesCarnegie Mellon University, USA
Dunja MladenicJ. Stefan Institute, Slovenia
Motivation
Need a collection of documents matching a particular concept
Search on the web, modify query, analyze documents, modify query,…
Repetitive, time-consuming, requires reasonable familiarity with the concept
Task
Given: 1 Document in Target Concept 1 Other Document (negative example) Access to a Web Search Engine
Create a Corpus of the Target Concept quickly with no human effort
Algorithm
Query Generator WWWSeed Docs
Filter/Classifier
Web
Word Statistics
Initial Docs
Build Query
Filter
Relevant
Non-Relevant
Learning
Query Generation
Examine current relevant and non-relavent documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones
A Query consists of m inclusion terms and n exclusion terms e.g +intelligence +web –military
Query Term Selection Methods
Uniform (UN) – select k words randomly from the current vocabulary
Term-Frequency (TF) – select top k words ranked according to their frequency
Probabilistic TF (PTF) – k words with probability proportional to their frequency
Query Term Selection Methods
RTFIDF – top k words according to their rtfidf scores
Odds-Ratio (OR) – top k words according to their odds-ratio scores
Probabilistic OR (POR) – select k words with probability proportional to their Odds-Ratio scores
Query Parameters
4 Parameters Inclusion Term-Selection Method Exclusion Term-Selection Method Inclusion Length Exclusion Length
Example: Odds-Ratio, rtfidf, 3,6
Experimental Setup
Language: Slovenian Initial documents: 1 web page in Slovenian, 1
in English Search engine: Altavista
Evaluation
Goal: Collect as many relevant documents as possible while minimizing the cost
Cost Number of total documents retrieved from the Web Number of distinct Queries issued to the Search Engine
Evaluation Measures Percentage of retrieved documents that are relevant Number of relevant documents retrieved per unique query
Fixed Query Parameters
Fix Query Lengths and Vary Term-Selection Methods
Fix Term-Selection Methods and Vary Query Lengths
Results (Ghani et al. , SIGIR 2001): Odds-Ratio works well overall Long Queries are precise but with low recall
Why Online Learning?
Different Term-Selection Methods Excel with different Query Lengths
Best Combination of methods and lengths may change as different parts of the Web/feature space are explored
Learning Methods
Memory-Less (ML) Learning Ignore all history and only use the current performance
Long-Term Memory (LT) Learning Use all of the previous history Additive Update Rule Multiplicative Update Rule
Fading Memory (FM) Learning Use all of the history but with a decay function over time
Results
LTM LTM
Memory-Less Memory-Less
Results
Further Experiments
Other Languages Similar results with Croatian, Czech and Tagalog
Keywords Similar results when initializing with keywords
instead of documents Comparison to Altavista’s “More Like This”
Better performance than Altavista’s feature
Conclusions
Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines
Online Learning is useful in adapting to different parts of the Web space
System and Corpora are/will be available at www.cs.cmu.edu/~TextLearning/CorpusBuilder