query biased snippet generation in xml search yi chen yu huang, ziyang liu, yi chen arizona state...

34
Query Biased Snippet Query Biased Snippet Generation Generation in XML Search in XML Search Yu Huang, Ziyang Liu, Yi Chen Yi Chen Arizona State University

Post on 21-Dec-2015

226 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

Query Biased Snippet Query Biased Snippet GenerationGeneration

in XML Searchin XML Search

Yu Huang, Ziyang Liu, Yi ChenYi ChenArizona State University

Page 2: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 2

Snippets in Text SearchSnippets in Text Search

Snippets are widely used in text search engine to help users to quickly identify relevant query results.

Page 3: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 3

Fragment of an XML Search Fragment of an XML Search ResultResult

Find the apparel retailers in Texas Keyword Search

Texas, apparel, retailer

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

suit1

clothes2 clothes3 clothes4 clothes5

fitting

men2

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

men4

category

sweater4

categoryfitting

women5 outwear5

name product

Brook Brothers apparel

casual4

situation

casual1

name

Galleria

name

West

Village

…… ……

…………

There can be many large search results.

Good snippets can help users to quickly and easily judge the relevance.

Page 4: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 4

A Sample SnippetA Sample SnippetFrom the snippet, we know The corresponding query result

contains matches to all keywords

The retailer is “Brook Brothers” This retailer has many stores in

Houston. The clothes featured by this

retailer.

It helps us to differentiate this query result from other apparel retailers (e.g. Carter’s)

store

state city

merchandises

clothes

fitting

men

Texas Houston

retailer

clothes

situation

casual

category

outwear

name product

Brook Brothers

apparel

How to generate good snippets for XML search?

No existing work on XML snippet generation yet.

Page 5: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 5

Challenges and Our Challenges and Our ContributionsContributions

What are desirable properties of a good snippet?

Identified three properties: self-contained, distinguishable, representative

What information in the query result is significant in order to achieve the properties?

Designed an algorithm to generate a ranked list of significant information - IList

How to generate a snippet to maximally cover the significant information within a size bound?

Proved the NP-hardness of this problem.

Designed an efficient and effective algorithm for snippet generation

eXtract: The first system on snippet generation for XML search

Page 6: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 6

RoadmapRoadmap

Identifying desirable properties of a good snippet Self-contained Distinguishable Representative

Constructing an information list – IList IList is a ranked list of significant information in the query result in

order to achieve the properties.

Building snippets based on IList within a snippet size bound

Experimental evaluation

Conclusions

Page 7: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 7

Self-contained SnippetSelf-contained Snippet

Snippets should be self-contained in order to be understandable.

Text search: snippets usually preserve self-contained semantic units: phrases / sentences surrounding keyword matches.

XML search: semantic units should be preserved.

Challenge: What is a semantic unit?

Page 8: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 8

Query Result Fragment Query Result Fragment (revisited)(revisited)

Adding keywords and their corresponding entity names to IList.

IList: Texas, apparel, retailer, store

Data contain Entities Attributes

A self-contained snippet should contain names of the entities whose attributes are in snippets

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

retailer

category

suit1

clothes2 clothes3

fitting

men2

situation

formal2

fitting

women3

name product

Brook Brothers

apparel

situation

casual1

name

Galleria

…… ……

……

……

Page 9: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 9

Distinguishable SnippetDistinguishable Snippet

Snippets should be distinguishable, so that users can easily differentiate query results

Text search: the title of the document is included.

XML search: the “key” of the result should be included.

Challenge: What is the key of an XML search result?

Page 10: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 10

Query Result FragmentQuery Result Fragment

Adding the key of the query result to IList.

IList: Texas, apparel, retailer, store, Brook Brothers

We can mine keys of entities

return entity

support

entity

We identify two types of entities in a query result. Return entities Support entities

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

retailer

category

suit1

clothes2 clothes3

fitting

men2

situation

formal2

fitting

women3

name product

Brook Brothers

apparel

situation

casual1

name

Galleria

…… ……

……

……

Inferring return entities:

An entity whose name or attribute name match keywords; otherwise the highest entity

Key of a query result Keys of return entities

Page 11: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 11

Representative SnippetRepresentative Snippet

Snippets should provide summaries of the query results, so that users can quickly grasp the essence of the results

Text search: active research area;

sometimes the first and/or last sentence of a paragraph is used as a summary.

XML search: include “dominant features” of query results

Challenges:

• What are features?

• What are dominant features?

Page 12: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 12

Features of Query ResultFeatures of Query Result

We define a feature as (entity, attribute, value).

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

retailer

category

suit1

clothes2 clothes3

fitting

men2

situation

formal2

fitting

women3

name product

Brook Brothers

apparel

situation

casual1

name

Galleria

…… ……

……

……

Feature type

Houston:6

Austin: 1

Other values (3): 3

Men: 600

Women: 360

Children: 40

Casual: 700

Formal: 300

Outwear: 220

Suit: 120

Skirt: 80

Sweaters: 70

Other values (7): 510

city:

fitting:

situation:

category:

entity: attribute: value: # of occurrences

store:

clothes:

clothes:

clothes:

Some feature statistics

Page 13: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 13

Houston:6

Austin: 1

Other values (3): 3

Men: 600

Women: 360

Children: 40

Casual: 700

Formal: 300

Outwear: 220

Suit: 120

Skirt: 80

Sweaters: 70

Other values (7): 510

city:

fitting:

situation:

category:

entity: attribute: value: # of occurrences

store:

clothes:

clothes:

clothes:

Dominant Features of Query Dominant Features of Query ResultResult

A feature that occurs often is likely to be dominant.

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

retailer

category

suit1

clothes2 clothes3

fitting

men2

situation

formal2

fitting

women3

name product

Brook Brothers

apparel

situation

casual1

name

Galleria

…… ……

……

……

But this is not always reliable. Dominance score the # of occurrence of a feature / the avg. # of occurrences of features of the same type

Dominant features Features with dominance score ≥ 1

Page 14: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 14

Representative SnippetRepresentative Snippet

Adding dominant features to IList in the order of dominance scores

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

retailer

category

suit1

clothes2 clothes3

fitting

men2

situation

formal2

fitting

women3

name product

Brook Brothers

apparel

situation

casual1

name

Galleria

…… ……

……

……

IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

Houston:6

Austin: 1

Other values (3): 3

Men: 600

Women: 360

Children: 40

Casual: 700

Formal: 300

Outwear: 220

Suit: 120

Skirt: 80

Sweaters: 70

Other values (7): 510

city:

fitting:

situation:

category:

entity: attribute: value: # of occurrences

store:

clothes:

clothes:

clothes:

Page 15: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 15

RoadmapRoadmap

Identifying desirable properties of a good snippet Self-contained Distinguishable Representative

Constructing an information list – IList IList is a ranked list of significant information in the query result in

order to achieve the properties.

Building snippets based on IList within a snippet size bound

Experimental evaluation

Conclusions

Page 16: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 16

RoadmapRoadmap

Identifying desirable properties of a good snippet Self-contained Related entity names Distinguishable Key of query result (return entities) Representative Dominant features

Constructing an information list – IList IList is a ranked list of significant information in the query result in

order to achieve the properties.

Building snippets based on IList within a snippet size bound

Experimental evaluation

Conclusions

IList

Page 17: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 17

RoadmapRoadmap

Identifying desirable properties of a good snippet Self-contained Related entity names Distinguishable Key of query result (return entities) Representative Dominant features

Constructing an information list – IList IList is a ranked list of significant information in the query result in

order to achieve the properties.

Building snippets based on IList within a snippet size bound

Experimental evaluation

Conclusions

IList

Page 18: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 18

Instance Selection ProblemInstance Selection Problem

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

suit1

clothes2 clothes3 clothes4 clothes5

fitting

men2

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

men4

category

sweater4

categoryfitting

women5 outwear5

name product

Brook Brothers apparel

casual4

situation

casual1

name

Galleria

name

West

Village

…… ……

…………

IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

Input: query result R, IList, a snippet size bound B Output: snippet S

Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B?

Page 19: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 19

Instance Selection ProblemInstance Selection Problem

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

suit1

clothes2 clothes3 clothes4 clothes5

fitting

men2

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

men4

category

sweater4

categoryfitting

women5 outwear5

name product

Brook Brothers apparel

casual4

situation

casual1

name

Galleria

name

West

Village

…… ……

…………

Input: query result R, IList, a snippet size bound B Output: snippet S

Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B?

Good Bad

IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

Page 20: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 20

Instance Selection ProblemInstance Selection ProblemChallenges: The cost of covering an IList item is dynamic The number of IList items that can be covered is unknown till

the very end.

The Instance Selection Problem is NP hard.

We designed an efficient and effective greedy algorithm to tackle this problem

Page 21: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 21

Instance Selection Algorithm Instance Selection Algorithm

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

suit1

clothes2 clothes3 clothes4 clothes5

fitting

men2

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

men4

category

sweater4

categoryfitting

women5 outwear5

name product

Brook Brothers apparel

casual4

situation

casual1

name

Galleria

name

West

Village

…… ……

…………

IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

weight: 1 1 1 ½ ¼ 1/8 1/16 1/32 1/64

Path based instance selection Coverage: the entities on the path and their attributes Benefit: the total weight of IList items covered Cost: the path length

Page 22: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 22

Instance Selection AlgorithmInstance Selection Algorithm

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

suit1

clothes2 clothes3 clothes4 clothes5

fitting

men2

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

men4

category

sweater4

categoryfitting

women5 outwear5

name product

Brook Brothers apparel

casual4

situation

casual1

name

Galleria

name

West

Village

…… ……

…………

1. For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it

2. Update benefits and costs of other paths

3. Go to step 1 till the size bound is reached or the whole IList is covered

IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

Page 23: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 23

Instance Selection AlgorithmInstance Selection Algorithm

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

suit1

clothes2 clothes3 clothes4 clothes5

fitting

men2

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

men4

category

sweater4

categoryfitting

women5 outwear5

name product

Brook Brothers apparel

casual4

situation

casual1

name

Galleria

name

West

Village

…… ……

…………

1. For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it

2. Update benefits and costs of other paths

3. Go to step 1 till the size bound is reached or the whole IList is covered

IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

Page 24: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 24

Final SnippetFinal Snippet

store1

state citymerchandises1

clothes1

fitting

men1

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

suit1

clothes2 clothes3 clothes4 clothes5

fitting

men2

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

men4

category

sweater4

categoryfitting

women5 outwear5

name product

Brook Brothers apparel

casual4

situation

casual1

name

Galleria

name

West

Village

…… ……

…………

IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

Page 25: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 25

RoadmapRoadmap

Identifying desirable properties of a good snippet Self-contained Distinguishable Representative

Constructing an information list – IList IList is a ranked list of significant information in the query result in

order to achieve the properties.

Building snippets based on IList within a snippet size bound

Experimental evaluation

Conclusions

Page 26: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 26

Experimental SetupExperimental Setup

Comparing the performance of Greedy Algorithm for Instance Selection -- eXtract Optimal (but exponential) Algorithm for Instance Selection Google Desktop

Measurements Search quality Speed Scalability

Data sets: Films, RetailerQuery sets: Eight queries for each data set

Page 27: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 27

Ten users were asked to score the snippets generated by the three approaches on the same query results

The two approaches designed specifically for XML snippet generation have much higher scores than Google Desktop

Greedy algorithm (eXtract) has close scores to the Optimal algorithm

Search Quality: User StudySearch Quality: User Study

Page 28: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 28

Search Quality: Precision & Search Quality: Precision & RecallRecall

Through another user study, the ground truth of snippets are obtained.

The snippets generated by the Greedy algorithm in eXtract have close precision and recall as the Optimal algorithm

Precision Recall

Page 29: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 29

SpeedSpeed

Film Data Set Retailer Data Set

The performance of the Greedy algorithm is much better than that of the Optimal algorithm

Page 30: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 30

ScalabilityScalability

Scalability on Snippet Size

(number of edges)

The scalability of the Greedy algorithm is much better than that of the Optimal algorithm

Scalability on Query Result Size (KB)

Page 31: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 31

ConclusionsConclusions

The first work that generates result snippets for keyword search on XML data

Identified the desirable properties for snippets Self-contained Distinguishable Representative

Designed an algorithm to generate IList as a ranked list of significant items to be included in snippets

Proved that the instance selection problem is NP-hard

Designed an efficient algorithm to cover IList in building a snippet within a size bound

Experiments verified the effectiveness and efficiency

Page 32: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 32

Thank You!Thank You!

Questions?Questions?

Welcome to visit eXtract demo in VLDB 2008Welcome to visit eXtract demo in VLDB 2008http://eXtract.asu.edu/http://eXtract.asu.edu/

Page 33: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008 33

Architecture of eXtractArchitecture of eXtract

IndexBuilder

XMLIndex

Return Entity IdentifierQuery

&

ResultDominant

Feature

Identifier

IList,

Query Result

Instance

Selector

Result

Snippet

Data Analyzer

Query Result Key Identifier

Page 34: Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD 2008

Snippets Comparison

store

state city

merchandises

clothes

fitting

men

Texas Houston

retailer

clothes

situation

casual

category

outwear

name product

Brook Brothers

appareleXtract

Google Desktop