named entity recognition in query jiafeng guo 1, gu xu 2, xueqi cheng 1,hang li 2 1 institute of...

Named Entity Recognition in Query

Jiafeng Guo1, Gu Xu2, Xueqi Cheng1,Hang Li2

1Institute of Computing Technology, CAS, China2Microsoft Research Asia, China

Outline

• Problem Definition• Potential Applications• Challenges• Our Approach• Experimental Results• Summary

Outline

Problem Definition

Named Entity Recognition in Query (NERQ)Identify Named Entities in Query and Assign them into Predefined Categories with Probabilities

Harry Potter

Harry Potter WalkthroughMovie Book Game

Movie Book Game

0.50.4

0.0 1.00.0

Outline

NERQ in Searching Structured Data

GamesBooks

Unstructured Queries

Structured Databases (Instant Answers, Local Search Index, Advertisements and etc)

NERQ Module

Smarter DispatchThis query prefers the results from the “Games” databaseBetter Ranking “harry potter” should be used as key to match the records in the database, and further ranked by “walkthrough”

harry potter walkthrough

Movies

NERQ in Web Search

Search results can be better if we know that “21 movie” indicates searcher wants the movie named 21

Outline

Challenges

• NER (Named Entity Recognition) – Well formed documents (e.g. news articles)– Usually a supervised learning method based on a set of

features• Context Feature: whether “Mr.” occurs before the word• Content Feature: whether the first letter of words is capitalized

• NERQ– Queries are short (2-3 words on average)

• Less context features

– Queries are not well-formed (typos, lower cased, …)• Less content features

Outline

• Problem Definition• Motivation and Potential Applications• Challenges• Our Approach• Experimental Results• Summary

Our Approach to NERQ

• Goal of NERQ becomes to find the best triple (e, t, c)* for query q satisfying

Harry Potter Walkthrough

“Harry Potter” (Named Entity) + “# Walkthrough” (Context) te“Game” Class c

ctpecpep

)(),,(

maxarg

,,maxarg *c) t,(e,

Training With Topic Model

• Ideal Training Data T = {(ei, ti, ci)}

• Real Training Data T = {(ei, ti, *)}

– Queries are ambiguous (harry potter, harry potter review)

– Training data are a relatively few

i iii ctep ,,max

e eei c i

i c iiii c ii

ictpecpep

ctpecpepctep

max,,max

Training With Topic Model (cont.)

e eei c ii

ctpecpepmax

harry potterkung fu pandairon man…………………………………………………………………………………………………………

# wallpapers# movies# walkthrough# book price……………………………………………………………………………………

# is a placeholder for name entity. # means “harry potter” here

Book……………………

Topics

Weakly Supervised Topic Model

• Introducing Supervisions– Supervisions are always better– Alignment between Implicit Topics and Explicit Classes

• Weak Supervisions– Label named entities rather than queries (doc. class labels)– Multiple class labels (Binary Indicator)

Kung Fu Panda

Movie Game Book

Distribution Over Classes

Weakly Supervised LDA (WS-LDA)

• LDA + Soft Constraints (w.r.t. Supervisions)

• Soft Constraints

,,log, yCwpyw LDA Probability Soft Constraints

ii iz y y C ,

Document Probability on the i-th Class

Document Binary Label on the i-th Class

1 1 0 0 iyTopic

System Flow ChatOnline Offline

Set of named entities with labels

Create a “context” document for each seed and train WS-LDA

Contexts

epecp ,

Find new named entities by using obtained contexts and estimate p(c|e) using WS-LDA and p(e)

Entities

Input Query

Evaluate each possible triple (e, t, c)

Results

Outline

Experimental Results

• Data Set– Query log data• Over 6 billion queries and 930 million unique queries• About 12 million unique queries

– Seed named entities• 180 named entities labeled with four classes• 120 named entities are for training and 60 for testing

Experimental Results (cont.)

• NERQ Precision

• Named Entity Retrieval and Ranking– class distribution

• Aggregation of seed context distributions (Pasca, WWW07)• p(t |c) from WS-LDA model

– q(t |e) as entity distribution – Jensen-Shannon similarity between p(t |c) and q(t |e)

• Comparison with LDA– Class Likelihood of e:

iii ecpy

Outline

Summary

• We first proposed the problem of named entity recognition in query.

• We formulized the problem into a probabilistic problem that can be solved by topic model.

• We devised weakly supervised LDA to incorporate human supervisions into training.

• The experimental results indicate that the proposed approach can accurately perform NERQ, and outperforms other baseline methods.

THANKS!

named entity recognition in query jiafeng guo 1, gu xu 2, xueqi cheng 1,hang li 2 1 institute of...

named entities

query nerqidentify

web search search results

structured data search

local search index

nerq wil help

game database

context walkthrough

Documents

jiafeng guo(ict) xueqi cheng(ict) hua-wei shen(ict) gu xu...

mi-hunt cover types - michigandnr.com · 1 1 1 1 1 1 1 1 1...

east data processing of divertor probes on east jun wang,...

greentown china holdings limited€¦ · recommendation...

adversarial immunization for improving certifiable...

fun for all - uab barcelona · 2018. 5. 19. · fun for all...

modeling diverse relevance patterns in ad-hoc retrieval ›...

is top-k sufficient for ranking? yanyan lan, shuzi niu,...

1 a probabilistic model for bursty topic discovery in...

query rewriting for extracting data behind html forms xueqi...

young artists’ platform: anna peletsis and jiafeng chen...

virome analysis for identification of novel mammalian ......

match2: a matching over matching modelfor similar question...

beatgan: anomalous rhythm detection using adversarially...

dynamic resource pooling and trading mechanism in · pdf...

more than relevance: high utility query recommendation by...

· web viewarthropoda 2 2 2 ectoprocta 2 2 2 phronida 2 2...

· pdf filels-titan miniature din switches.. sensing...

c2 2 chemistry organic - mc3cb.com · &+ 2+ + + + 2+ 2+ 2+...

molded case circuit breakers - cloud object storage | …...