towards context-aware search by learning a very large variable length hidden markov model from...

29
Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1 , Daxin Jiang 2 , Jian Pei 3 , Enhong Chen 1 and Hang Li 2 1 University of Science and Technology of China, 2 Microsoft Research Asia, 3 Simon Fraser University

Upload: luke-sparks

Post on 27-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov

Model from Search Logs

Huanhuan Cao1, Daxin Jiang2, Jian Pei3, Enhong Chen1 and Hang Li2

1University of Science and Technology of China, 2Microsoft Research Asia,3Simon Fraser University

Page 2: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Context of User Queries

• A user usually raises multiple queries and conducts multiple rounds of interactions for an information need.

User

query query Current Query

click

click

click

Context

One round of Interaction

Page 3: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

An Example

• Suppose Ada plans to buy a new car and need some cars reviews.

• But she doesn’t know to formulate an effective query. Consequently, she raises a series of queries about different cars.

• No surprisingly, for each query, the review web sites are ranked low and not easy to be noticed.

Page 4: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Why Context is Useful?

• Suppose we have such a search log:SID Search session

S1 Ford => Toyota => GMC => Allstate

www.autohome.com

S2 Ford cars => Toyota cars => GMC cars => Allstate

www.autohome.com

S3 Ford cars => Toyota cars => Allstate

www.allstate.com

S4 GMC => GMC dealers

www.gmc.com

Page 5: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Patterns in The Search Log

• Pattern1: – 50% users clicked a cars review web site

www.autohome.com after asking a series of cars.

• Ada will have better experience if the search engine knows pattern1.

Page 6: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

• Pattern2:– 75% users searched for car insurances after a series of

queries about different cars.

• The search engine will provide more appropriate query suggestions and URL recommendations if it knows pattern2.

• Idea: Learning from search log to provide context-aware ranking,

query suggestion and URL recommendation.

Page 7: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Related Work

Mining “wisdom of the crowds” from search logs

Improve rankingUse click-through data as implicit feedback

Query suggestion

Mining click-through data

Mining session data

Mixture: CACB

URL recomendation Mining search trials

Only CACB considers context, but:1. CACB constraints a query to one search intent2. CACB doesn’t use click information as context3. CACB can only be used for query suggestion

Page 8: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Modeling Context by vlHMM(variable length Hidden Markov Model)

Page 9: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Overview of Technique Details

• Definition of vlHMM• Parameters Estimation• Challenges and Strategies• Applications

Page 10: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Formal Definition• Given:

– A set of hidden states {s1 … sNs};– A set of queries {q1 … qNq};– A set of URLs {u1 … uNu}; – The maximal length Tmax of state sequences

• A vlHMM is a probability model defined as follows:– The transition probability distribution Δ = {P(si|Sj)};– The initial state distribution Ψ = {P(si)};– The emission probability distribution for each state sequence Λ

= {P(q, U|Sj)};

Page 11: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Parameter Estimation

• Let X = {O1…ON} be the set of training sessions, where: – On is a sequence of pairs (qn,1 ,Un,1) … (qn,Tn ,Un,Tn)

– qn,t and Un,t are the t-th query and the set of clicked URLs, respectively

– Moreover, we use un,t,k to denote the k-th URL in Un,t .

• The task is to find Θ* such that

Page 12: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

EM

• The original problem is in a complex form which may not have a closed-form solution.

• Alternatively, we use an iterative method: EM (Expectation Maximum).

• Objective function:

Page 13: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

• E-step:

• M-step:

Page 14: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Challenges for Training A Large vlHMM

• Challenge1:– The EM algorithm needs a user-specified number of

hidden states. – However, in our problem, the hidden states correspond to

users' search intents, whose number is unknown.

• Strategy:– We apply the mining techniques developed by our

previous work as a prior process to the parameter learning process.

Page 15: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

• Challenge2: – Search logs may contain hundreds of millions of training

sessions.– It is impractical to learn a vlHMM from such a huge

training data set using a single machine.

• Strategy:– We deploy the learning task on a distributed system under

the map-reduce programming model

Page 16: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

• Challenge3: – Each machine needs to hold the values of all parameters. – Since the log data usually contains millions of unique

queries and URLs, the space of parameters is extremely large.

• Strategy:– we develop a special initialization strategy based on the

clusters mined from the click-through bipartite

Page 17: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Applications

• Given a observation O consists of q1 … qt and U1 … Ut

• Document re-ranking: – Rank by P(u|O) = ∑ P(u|st) P(st|O)

• Query suggestion & URL-recommendation:

– Suggest top k queries with P(q|O) = ∑ P(q|st+1) P(st+1|O)– Recommend top k URLs with P(u|O) = ∑ P(u|st+1) P(st+1|O)

• The advantages of our model: unification and power of prediction.

Page 18: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Experiments• A large-scale search log from Live Search– Web searches in English from the US market

• Training Data– 1,812,563,301 search queries, – 2,554,683,191 clicks– 840,356,624 sessions– 151,869,102 unique queries– 114,882,486 unique URLs.

• Test Data– 100,000 sessions extracted from another search log

Page 19: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Coverage

• For each test session <(q1 ,U1)…(qT UT )>, the vlHMM deals with each qi . When i > 1, <(q1 ,U1)…(qi-1 ,Ui-1)> is used as a context.

• The total coverage is 58.3%. • Denote the set of test cases without context as Test0

and the other as Test1. • For the covered cases in Test1, 25.5% contexts are

recognized.

Page 20: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Re-ranking

• Baseline: – Boost the URLs with high click times given the query.

• Evaluation:– Sample 500 re-ranking URL pairs from Test0 and from the

cases whose context are recognized in Test1, respectively.

– Each re-ranking URL pair is judged as Improved or Degraded or Unsure by 3 experts.

Page 21: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

The effectiveness of re-ranking by thevlHMM and Baseline1 on (a) Test0 and (b) Test1.

Page 22: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Examples of Re-ranking

Search for games Up the URL about game

Visit the homepage of Ask Jeeves

Up the URL which introduces the history of Ask Jeeves

Page 23: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

URL Recommendation

• Baseline: – Recommend the URLs with high click times following the

current query.

• Evaluation:– “Leave-one-out" method: given <(q1 ,U1)…(qT UT )>, we use

qT-1 as the test query and consider UT as the ground truth.

– Suppose the set of recommended URLs is R, the precision is |R∩UT |/|R| and the recall is |R ∩ UT |/|UT |.

Page 24: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

The precision and recall of the URLs recommended by the vlHMM and Baseline2.

Page 25: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

An Example of URL Recommendation

Search for online store about electronics

Online store about equipments

Online store about electronics

Page 26: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Query Suggestion

• Baseline:– CACB, a context-aware concept based approach of query

suggestion.

• Evaluation:– The results of two approaches are comparable since they

both consider contexts.

– However, the ratio of recognizing contexts is increased by 55% by vlHMM.

Page 27: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Summary

• We propose a general approach to context-aware search by learning a vlHMM from log data.

• We tackle the challenges of learning a large vlHMM with millions of states from hundreds of millions of search sessions.

• The experimental results on a large real data set clearly show that our context-aware approach is both effective and efficient.

Page 28: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

• Our recent works on context-aware search:• Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-tao

Sun, Enhong Chen and Qiang Yang. Context-aware query classification. To appear in SIGIR’09.

• Huanhuan Cao, Daxin Jiang, Jian Pei, Enhong Chen and Hang Li. Towards context-aware search by learning a large variable length Hidden Markov Model from search logs. To appear in WWW’09.

• Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen and Hang Li. Context-aware query suggestion by mining click-through and session data. KDD’08, pages 875-883, 2008. (This paper won the Best Application Paper Award of KDD’08)

Page 29: Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong

Thanks