ziyang liu, peng sun, yi chen arizona state university s tructured q uery r esult d ifferentiation

28
ZIYANG LIU, Peng Sun, Yi Chen Arizona State University STRUCTURED QUERY RESULT DIFFERENTIATION

Upload: lynne-floyd

Post on 17-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

ZIYANG LIU, Peng Sun, Yi ChenArizona State University

STRUCTURED QUERY RESULT DIFFERENTIATION

Page 2: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

2 KEYWORD SEARCH ON STRUCTURED DATA

Effective techniques have been developed to help users find relevant results? Ranking: sort the results in the order of estimated relevance Snippet: provide a summary of each result to help users judge

relevance

50% of keyword searches are information exploration queries, which inherently have multiple relevant results Users intend to investigate and compare multiple relevant

results.

How to help user compare relevant results?

Keywords

Search Engine

Results: Relevant Data Fragments

Structured Data

Web Search

50% Navigation

50% Information Exploration

Broder, SIGIR 02

Page 3: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

3RESULTS AND SNIPPETS

store

city

Phoenix

name

BHPhoto

merchandises

category

DSLR

camera

brand

Canon

megapixel

12

category

DSLR

camera

brand

Sony

megapixel

12

……

store

city

Phoenix

name

Adorama

merchandises

category

Compact

camera

brand

HP

megapixel

14

category

Compact

camera

brand

Canon

megapixel

12

……

“Phoenix, camera, store”

store

city name

BHPhoto

merchandises

brand

Canon

camera

megapixel

12

brand

Canon

camera

PhoenixSnippet

store

city

Phoenix

name

Adorama

merchandises

category

Compact

camera

brand

Canon

megapixel

12

Snippet

Snippets are unhelpful in differentiating query results.

(Huang et al. SIGMOD 09)

Page 4: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

4DIFFERENTIATION FEATURE SETS(DFS)

store

city

Phoenix

name

BHPhoto

merchandises

category

DSLR

camera

brand

Canon

megapixel

12

category

DSLR

camera

brand

Sony

megapixel

12

……

store

city

Phoenix

name

Adorama

merchandises

category

Compact

camera

brand

HP

megapixel

14

category

Compact

camera

brand

Canon

megapixel

12

……

DFS

DFS

Feature Type value

store: name BHPhoto

camera: brand CanonCanonSony

camera: category DSLR

Feature Type value

store: name Adorama

camera: brand CanonHP

camera: category Compact

Feature: (entity, attribute, value)

Bank websites usually allow users to compare selected credit cards, however, only with a pre-defined feature set.

Page 5: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

5CHALLENGES OF RESULT DIFFERENTIATION

How to automatically generate DFS that highlight the differences among results?

How to measure the quality of a set of DFSs? DFSs should obviously maximize the

difference among results. How to quantify it?

What are other desirable properties?

Can DFSs be efficiently generated from results?

Page 6: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

6 CONTRIBUTIONS

1st work on automatically differentiating structured search results Application domains: online shopping, employee hiring,

job/institution hunting, etc.

Identifying 3 desiderata for good DFSs

Quantifying the differentiation power of a set of DFSs

Proving the NP-hardness of DFS generation

Tackling the problem using two local optimality criteria Single-swap / Multi-swap optimality

Implemented XRed: XML Result Differentiation

Empirically verified the effectiveness & efficiency of XRed

Page 7: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

7 ROADMAP

Desiderata for good DFSs

Problem definition

Local optimality and algorithms

Experiments

Page 8: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

8DESIDERATUM 1BEING SMALL

A Small DFS is easy for user to go through and compare with other DFSs.

The size of each DFS, |D|, cannot exceed a user-specified upper bound L

|D| ≤ L

Page 9: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

9

DESIDERATUM 2SUMMARIZING QUERY RESULTS

DFSs that do not summarize the results show useless & misleading differences.

store

city

Phoenix

name

BHPhoto

merchandises

category

DSLR

camera

brand

Canon

megapixel

12

category

DSLR

camera

brand

Sony

megapixel

12

……

store

city

Phoenix

name

Adorama

merchandises

category

Compact

camera

brand

HP

megapixel

14

category

Compact

camera

brand

Canon

megapixel

12

……

DFS

DFS

Feature Type DFS

camera:brand HP

Feature Type DFS

camera:brand Canon

This store sells only a few HP cameras.

Page 10: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

10

DESIDERATUM 2SUMMARIZING QUERY RESULTS

DFSs that do not summarize the results show useless & misleading differences.

store

city

Phoenix

name

BHPhoto

merchandises

category

DSLR

camera

brand

Canon

megapixel

12

category

DSLR

camera

brand

Sony

megapixel

12

……

store

city

Phoenix

name

Adorama

merchandises

category

Compact

camera

brand

HP

megapixel

14

category

Compact

camera

brand

Canon

megapixel

12

……

DFS

DFS

Feature Type DFS

camera:brand Canon

camera:brand HP

Feature Type DFS

camera:brand Canon

camera:brand HP

This store sells only a few HP cameras.

Page 11: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

11

A DFS is valid only if it summarizes the corresponding result. Features of the same type should be included in

order of occurrences.

Ratios of two features in the DFS should be roughly the same as in the result.

DESIDERATUM 2SUMMARIZING QUERY RESULTS

Dominance Ordered

Distribution Preserved

Page 12: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

12

DESIDERATUM 3DIFFERENTIATING QUERY RESULTS

Differentiation unit: feature type.

A feature type t in two DFSs D1 and D2 is differentiable if

The order of the features of type t is different.

The ratio of two features of type t is different.

D1. Camera: brand: CanonD2. Camera: brand: HP

D1. Camera: brand: CanonD2. Camera: brand: Canon Camera: brand: HP

D1. Camera: brand: Canon Camera: brand: HPD2. Camera: brand: Canon Camera: brand: Canon Camera: brand: HP

Page 13: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

13

Degree of Differentiation (DoD) of two DFSs = Number of differentiable feature types.

DESIDERATUM 3DIFFERENTIATING QUERY RESULTS

Feature Type DFS

store:name BHPhoto

camera:brand CanonCanonSony

camera:category DSLR

Feature Type DFS

store:name Adorama

camera:brand CanonHP

camera:category Compact

DoD = 3

DoD of multiple DFSs = the sum of DoD of every pair of DFS.

Page 14: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

14 ROADMAP

Desiderata for good DFSs

Problem definition

Local optimality and algorithms

Experiments

Page 15: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

15DFS GENERATION PROBLEM

Given a set of results and a size limit L, generate a DFS for each result such that Their DoD is maximized.

Every DFS is valid (good summary)

Every DFS’s size does not exceed L.

We proved the NP-hardness of this problem by reduction from X3C.

Page 16: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

16 ROADMAP

Desiderata for good DFSs

Problem definition

Local optimality and algorithms

Experiments

Page 17: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

17 LOCAL OPTIMALITY

To tackle this hard problem, instead of achieving global optimality, we propose two local optimality criteria: Single-swap Optimality

Multi-swap Optimality

Page 18: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

18 SINGLE SWAP

A set of DFSs is Single-Swap Optimal, if adding / changing a single feature in a single DFS (subject to validity and size limit) cannot increase the DoD.

Feature Type Value

store: name BHPhoto

store: city Phoenix

camera: megapixel 12

camera: category DSLR

Feature Type Value

store: name Adorama

camera: brand CanonHP

camera: megapixel 12

DoD = 1

Feature Type Value

store: name Adorama

camera: brand CanonHP

camera: category Compact

DoD increases to 2

# of cameras: 200Category: DSLR: 188 Others: 12Brand: Canon: 103 Sony: 50 Nikon: 25 HP: 22Megapixel: 12: 160 13: 15 14: 20

STORE 1

# of cameras: 150Category: Compact: 140 Others: 10Brand: Canon: 80 HP: 70Megapixel: 12: 105 13: 5 14: 19 STORE 2

Achieved Single-Swap Optimal

Page 19: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

19ALGORITHM FOR SINGLE-SWAP OPTIMALITY

Start from a randomly generated DFS for each result.

Repeatedly add a feature / change a feature in a DFS.

Stop until the DoD no longer increases.

Does this algorithm terminate in polynomial time?

Yes: The maximum possible DoD for a set of DFSs is POLYNOMIAL. Each iteration increases the DoD at least by 1.Each iteration takes polynomial time.

Page 20: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

20MULTI-SWAP OPTIMALITY

A set of DFSs is Multi-Swap Optimal, if adding / changing any number of features in a single DFS (subject to validity and size limit) cannot increase the DoD.

Feature Type Value

store:name BHPhoto

store:city Phoenix

camera: megapixel 12

camera:category DSLR

Feature Type Value

store:name Adorama

camera: brand CanonHP

camera:category Compact

DoD = 2Feature Type Value

store:name BHPhoto

camera:brand CanonCanonSony

camera:category DSLR

DoD increases to 3

# of cameras: 200Category: DSLR: 188 Others: 12Brand: Canon: 103 Sony: 50 Nikon: 25 HP: 22Megapixel: 12: 160 13: 15 14: 20

STORE 1

# of cameras: 150Category: Compact: 140 Others: 10Brand: Canon: 80 HP: 70Megapixel: 12: 105 13: 5 14: 19 STORE 2

Page 21: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

21ALGORITHM FOR MULTI-SWAP OPTIMALITY

Start from a randomly generated DFS for each result.

Repeatedly add / change multiple features in a DFS.

Stop until the DoD no longer increases.

We designed a novel dynamic programming algorithm, which takes pseudo-polynomial time

This algorithm has exponential time complexity!

Page 22: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

22 EVALUATION

We have implemented Xred (XML Result Differentiation) and evaluated it empirically.

Data sets Film (http://infolab.stanford.edu/pub/movies) Camera Retailer (synthetic)

Result generation: XSeek (http://xseek.asu.edu/)

DFS size limit: 10% of # of feature types

Metrics: Quality (DoD) Efficiency

Comparison system: exponential algorithm that generates optimal solution.

Page 23: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

23 DFS QUALITY

QC1 QC2 QC3 QC4 QC5 QC6 QC7 QC80

20

40

60

Single-Swap Multi-Swap Optimal

DoD

DoD

QF1 QF2 QF3 QF4 QF5 QF6 QF7 QF80

20

40

DoD

Film

Camera Retailer

Page 24: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

24 EFFICIENCY

QC1 QC2 QC3 QC4 QC5 QC6 QC7 QC8-5.55111512312578E-17

0.02

0.04

0.06

0.08

0.1

Single-Swap Multi-Swap Optimal

Tim

e (s

)

QF1 QF2 QF3 QF4 QF5 QF6 QF7 QF8-5.55111512312578E-17

0.02

0.04

0.06

0.08

0.1

Tim

e (s

)Result Size1KB ~ 9KB

# of Results2 ~ 52

Film

Camera Retailer

Page 25: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

25 SCALABILITY

1 10 20 30 40 50 60 70 800

0.1

0.2

0.3

0.4

Query Result Size (KB)

Tim

e (s

)

101112131415161718-0.0999999999999997

3.05311331771918...

0.1

0.2

0.3

Single-SwapMulti-Swap

DFS Size Limit

10 100 200 300 400 5000

25

50

75

100

# of Query Results

Page 26: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

26 CONCLUSIONS

We initiate the problem of automatically differentiating structured query results, which is useful for information exploration queries.

We define Differentiation Feature Set (DFS) for each result, and identify three desiderata for DFS.

We formalize the DFS generation problem, and prove its NP-hardness.

We propose two local optimality criteria: single-swap and multi-swap, and design algorithms to efficiently achieve them.

We implemented the XRed system, and verified its effectiveness and efficiency through experiments.

Page 27: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

27 FUTURE WORK

Result differentiation is a new area and opens opportunities for new research topics. Is there a better way of selecting feature

types, e.g., by considering users’ interests? Is there a better way of measuring the

quality of DFSs besides DoD? Are there approximation / randomized

algorithms for DFS generation problem that achieve better quality / efficiency?

Page 28: ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

28