the water filling model and the cube test: multi-dimensional evaluation for professional search...

THE WATER FILLING MODEL AND THE CUBE TEST:�Multi-Dimensional Evaluation for Professional Search

Jiyun Luo1 Christopher Wing1 Grace Hui Yang1 Marti A. Hearst2

1Department of Computer Science Georgetown University Washington, DC, USA {jl1749, cpw26}@georgetown.edu [email protected] CIKM 2013

2School of Information University of California, Berkeley Berkeley, CA, USA [email protected]

1

INTRODUCTION

¢ Complicated search has recently received much attention

¢ Professional search activities are usually complicated search tasks �  Examples: Medical record search, Legal search,

Patent prior art search

¢ Evaluation metrics need to reflect this complexity �  U-measure for whole session evaluation [Sakai et al.

sigir’13] �  Time-based gain [Smucker and Clarke sigir’12] �  α-nDCG for diversity and novelty [Clarke et al. sigir’08] �  PRES for recall-orientated search tasks [Magdy and Jones,

sigir’10] 2

PROFESSIONAL SEARCH

¢  Rich information needs �  Multiple aspects or subtopics

¢  Time-sensitive �  It is not true that professional searchers, e.g., lawyers, are

evil and would like to read irrelevant documents since they are paid by time and only care about recall

¢  Novelty �  Once examined one relevant document, subsequent

relevant documents are perceived as less relevant

¢  Stopping criteria �  Once a sub-information-need has been fulfilled, relevant

documents about it will contribute not much any more

¢  A mix of unranked and ranked retrieval �  Boolean search and proximity search are still popular 3

Fenestration Segment Stent-Graft and Fenestration Method US 20090259290 A1

Patent Prior Art Search

ABSTRACT A method includes deploying a fenestration segment stent-graft into a main vessel such that a fenestration section …

1. A fenestration segment stent-graft comprising : a proximal section comprising a woven graft cloth; … 2. The fenestration segment stent-graft of claim 1 wherein said proximal section comprises a proximal end and a distal end, … 3. The fenestration segment stent-graft of claim 2 wherein said attachment means comprises stitching. … 20. A fenestration segment stent-graft comprising : a proximal section; a distal section; … 21. The fenestration segment stent-graft of claim 20 wherein said fenestration section comprises : graft material comprising loose woven fibers…

Claims

4

Looking for published literature that can be used to `say no’ to a patent application. A granted patent should be novel and non-trivial. Ø  Time constraint: less than 6 hours

Independent

Dependent Dependent Dependent

5

¢  Information need with multiple subtopics

¢  Goal: fulfill the info need with relevant documents as soon as possible

¢  A document can cover different subtopics

¢  Stop finding more relevant documents for a subtopic or for the entire information need

¢  A cube with multiple segments

¢  Goal: fill up the cube with water as soon as possible

¢  “document water” can flow in different segments

¢  Reaching a cap in a segment and no more water can go there

Professional Search The Water-filling Model

We draw an analogy between Professional Search and Filling Water into a Cube

How to judge a search system is good? Ø  We assume the searcher wants the multi-subtopics of a task

to be fulfilled as quickly as possible & as much as possible

The Task Cube

Ø  The Cube with unit length represents the entire information need

Ø  Each cuboid in the Cube represents a subtopic

Ø  The top of the Cube is the cap that limits the maximum amount of relevant information needed Ø  Stopping criterion

Ø  The bottom is segmented into different areas. Ø  The area size indicates the importance of each

subtopic. Ø  E.g. in prior art search, independent claims are

assigned more weights than dependent claims

6

An empty task cube for a search task with 6 subtopics

The Water Filling Model

7

Ø  A new coming relevant document will increase waters in all its relevant subtopics

Ø  The height increment is the relevance gain from that document with regard to that subtopic

Ø  The total height of the water in one cuboid represents the accumulated relevance gain for a subtopic

Ø  Total volume in the task Cube is the total Gain

The Cube Test

Ø  Based on the water-filling model, we design a new multi-dimensional evaluation metric for professional search: the Cube Test (CT)

8

Ø  CT calculates the rates of how fast a search system can fill up the task cube as much as possible

Ø  It is a speed function

The Gain Function

𝐺𝑎𝑖𝑛(𝑄,𝑑𝑗)=∑𝑖↑▒𝑎𝑟𝑒𝑎𝑖 ×height𝑖,𝑗 × KeepFilling𝑖

Ø  Document dj’s gain is calculated as the volume of relevant “document water” that matches to all subtopics in the task cube.

Ø  A more concrete equation:

where - Γ is a discounting factor for subtopic novelty, Γ = γnrel(ci,j-1) where nrel(ci, j-1) is # of relevant documents for subtopic ci in previously examined documents (d1 to dj-1).

- θi is the importance of the ith subtopic, ∑𝑖↑▒θ𝑖  = 1. - rel(d j,c i) is the water height, i.e., the document d j’s

relevance grade towards subtopic c i, - Ι is the indicator function, - MaxHeight is the cap for subtopic relevance (set to 1). 9

10

Ø Total Gain for a list of documents have been examined

The Total Gain Function

Ø Note that it does not assume any traversal order

Ø  It even does not assume ranked retrieval

Ø This allows us to support both ranked and unranked retrieval or a mix of them

The Cube Test - Recap

11

Ø  It is a speed function Ø The time function is the amount of time taken from the

beginning up to the tth document, it can be Ø  actual reading time Ø  a formulation similar to TBG [Smucker &

Clarke,sigir’12], taking into account document length ∑𝑗=1↑𝑡▒4.4+ 𝑟↓𝑖 ×(0.018𝑙↓𝑗 +7.8)  

Ø  or simply # of documents have been examined so far

EXPERIMENTS Datasets

USPTO •  It consists of three million US patent applications and

publications from 2001 to 2013 in XML with images removed. •  We created 33 runs for 49 prior art finding tasks. •  Office actions written by US Patent Examiners are parsed

and the ground truth are extracted automatically from them (PublicPair)

CLEF-IP 2012 •  XML patent documents from the European Patent Office

(EPO) prior to 2002 and 400,000+ documents published by the World Intellectual Property Organization (WIPO).

•  We evaluate the 31 official runs from 5 teams who participated CLEF-IP 2012.

12

Discriminative Power

Ø  We compare the new metric with a few well-known metrics: •  Recall •  I-rec (Sakai et al. EVIA’10] •  nDCG •  α-nDCG [Clarke et al. sigir’08] •  PRES [Magdy and Jones, sigir’10] •  MAP •  TBG [Smucker & Clarke, sigir’12] •  nERR-IA [Sakai & Song, sigir’11]

Ø  Evaluate the evaluation metrics by their discrimination power [Sakai, sigir’06]

Ø  We test a few variations of CT

Ø  In the CLEF-IP dataset, all CT metrics show high discriminative power.

13

Ø  For the USPTO dataset, Recall and I-rec show the best discriminative power. CT metrics show good discriminative power.

Tradeoff between coverage and single relevance

Ø  CT is able to adjust its bias between recall-oriented tasks and precision-oriented tasks

Ø  We create two artificial runs Ø  coverage run It arranges relevant

documents to each subtopic in a round-robin fashion.

Ø  single relevance run It puts all relevant documents ordered by rel(d, ci) for a subtopic first, then for the next subtopic.

CT vs. γ for the coverage run

CT vs. γ for the single relevance run

The novelty discount base γ ranges in [0.1,0.9]. When γ is small, CT has a big novelty discount, is biased towards coverage and rewards more for runs that spread relevant documents across different subtopics; When γ is big, CT is biased towards precision and rewards more for runs that produce highly relevant documents early.

14

Conclusions

Ø  This paper presents a novel evaluation metric (the Cube Test), based on a novel utility model (the water filling model)

Ø  It addresses several important dimensions in professional search, and in complicated search in general Ø  Covers different aspects or subtopics Ø  Subtopics no need to be equally important Ø  Allows for single document to cover several subtopics Ø  Is time-sensitive Ø  Handles the stopping criterion

Ø  Adding more relevant documents to certain subtopic will not help to improve the overall gain

Ø  Expresses the tradeoff between time, quality of documents, and diverse coverage of subtopics

15

Acknowledgments: Portions of this work were conducted to explore new concepts under the umbrella of a larger project at the US Patent and Trademark Office.

THANK YOU

Jiyun Luo1 Christopher Wing1 Hui Yang1 Marti A. Hearst2

1Department of Computer Science Georgetown University Washington, DC, USA {jl1749, cpw26}@georgetown.edu [email protected]

2School of Information University of California, Berkeley Berkeley, CA, USA [email protected]

16

CT Variations

17