large scale findability analysis shariq bashir phd-candidate department of software technology and...

19
Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems

Post on 20-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Large Scale Findability Analysis

Shariq BashirPhD-Candidate

Department of Software Technology and Interactive Systems

Agenda

Large Scale Findability Experiments Million of Patents used for Indexing Using all possible queries of a patent,

Findability is analyzedFindability Analysis

With different queries classes Frequent Terms impact on Findability

Issues in Queries used for Findability Analysis

Introduction

Patent Retrieval is a recall oriented domain Findability of each and every patent in collection is

considered as an important factor There is need to analyze, how many patents are hard

or easy to Find in collection. Findability Measurement:

Analysis are based on Findability Measurement Findability is a measurement in IR, used for

analyzing how easily we can find a document in collection.

Can Figure out Low and High Findable subsets Can compare different Retrieval Systems, which is better for

finding patents than other Can identify bias of system, whether system give

preference to shorter documents over longer, or longer over shorter.

Large Scale Patents Findability Experiments

In related Findability experiments, analysis are usually performed on a random set of queries.

For example, taking random set of 200 queries of 2 terms, 3 terms or 4 terms from each patent.

However, this does not clear us, that whether we are testing queries generation approach or retrieval system.

Large Scale Experiments: Rather than taking random queries, Experiments are

performed using all possible queries of a Patent. We considered all possible 3 Terms queries (using AND

operator). 1 million patents are used for indexing (with Full Text). TFIDF retrieval model is used for ranking patents to queries. (rank cutoff factor) c = 100 is used for analysis.

Large Scale Patents Findability Experiments

Since, all possible queries space is very large.

Therefore we could process only a small number of patents.

A set of Low and High Findable patents are used for Large Scale analysis. We take these patents from our previous

experiments, which were based on a random set of small number of queries.

Motivation: We want to make sure, whether low Findable

patents are really low Findable, or there is any fault in queries generation approach.

Patents

# Patent ID (Low Findable)

1 US-4299687-A

2 US-4318912-A

3 US-4127578-A

4 US-4136079-A

5 US-4031023-A

6 US-4034106-A

7 US-4034087-A

8 US-4229478-A

9 US-4087551-A

10 US-4082851-A

# Patent ID (High Findable)

1 US-4175011-A

2 US-4085890-A

3 US-4002425-A

4 US-4154025-A

5 US-4052415-A

6 US-4110128-A

7 US-4166008-A

8 US-4067813-A

9 US-4009156-A

10 US-4147736-A

Findability Results Analysis (Percentage in all Queries) Limitation of Numeric Score

Do not provide accurate analysis. For example consider two patents.

Using numeric score, Patent A has large Findability score than Patent B, but it has very poor Findability Percentage.

So, in next slides, analysis are based on Findability Percentage using all Queries of a Patent.

Moreover, for clear understanding, analysis are divided into four factors, What is Findability Percentage, in those Queries

• which can retrieve < 500 patents.• which can retrieve >= 500 & <= 1000 patents.• which can retrieve > 1000 & <= 1500 patents.• which can retrieve > 1500 patents.

# Unique Terms

Total Queries

Findability Percentage/ Total Queries

Findability Numeric Score

A 578 32 Million 1% 320,000

B 60 34,220 95% 32,509

Queries Distribution

79%

5.5%8.5%7% 8%

13%14%

65%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

(< 500) (>= 500 & <= 1000) (> 1000 & <= 1500) (> 1500)

# of Patents Contained All Query Terms

Perc

enta

ge in w

hole

Queri

es

Low Findable Patents

High Findable Patents

Large Percentage of Queries in both sets (Low and High Findable) can retrieve more than 1500 patents.

79% in Low Findable Patents. 65% in High Findable Patents.

Findability Percentage

0%

10%

20%

30%

40%

50%

60%

70%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Low Findable -- High FindablePatents

Findab

ility

Per

centa

ge

Average = 3.9%. Out of every 100 queries, patent can be findable from only 4 queries.

Average = 53.7%. Out of every 100 queries, patent can be findable from 54 queries.

Findability Distribution in Queries

18%

9%

16%

57%

8%13%

28%

52%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

(< 500) (>= 500 & <= 1000) (> 1000 & <= 1500) (> 1500)

# of Patents Contained all Query Terms

Perc

enta

ge in

all

Findable

Queri

es

Low Findable Patents

High Findable Patents

In what type of Queries, Patents have more Findable Percentage. In Low Findable Patents, Queries < 500 (patents) share more

Findability Percentage than others. (But only 7% of Queries in whole Queries set are < 500). 79% of queries

contained > 1500 patents. Based on these results, we can yield two important findings.

First low Findable Patents have very poor Findability Percentage (3.9%). Second, in 3.9% queries, most of the queries can retrieve < 500 patents.

0%

10%

20%

30%

40%

50%

60%

70%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Low Findable -- High FindablePatents

Perc

enta

ge in a

ll F

indable

Queri

es

< 500<=500 & <=1000>1000 & <=1500> 1500

Based on Average Based on Individual Patents

Findability Distribution in Queries

In High Findable Patents, Queries which can retrieve > 1500 (patents) share more

Findability Percentage than others. (65% of queries contain > 1500 patents).

18%

9%

16%

57%

8%13%

28%

52%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

(< 500) (>= 500 & <= 1000) (> 1000 & <= 1500) (> 1500)

# of Patents Contained all Query Terms

Perc

enta

ge in

all

Findable

Queri

es

Low Findable Patents

High Findable Patents

0%

10%

20%

30%

40%

50%

60%

70%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Low Findable -- High FindablePatents

Perc

enta

ge in a

ll F

indable

Queri

es

< 500<=500 & <=1000>1000 & <=1500> 1500

Based on Average Based on Individual Patents

Findability Distribution in Queries

Low Findable Patent = (Patent ID = US-4299687-A)

High Findable Patent = (Patent ID = US-4085890-A)

Findability Percentage in Different Queries Queries which can retrieve more than > 1500 patents.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Low Findable -- High FindablePatents

Perc

enta

ge

Findability PercentageQueries Percentage

On Average = 79% queries can retrieve > 1500 patents.

In all > 1500 queries, patents are finable from only 1.1% queries.

Out of 100 queries, patent is findable from almost one query

On Average = 65% queries can retrieve > 1500 patents.

In all > 1500 queries, patents are finable from only 49% queries.

Out of 100 queries, patent is findable from almost 49 queries

Findability Percentage in Different Queries Queries which can retrieve more than > 1000 & <= 1500

patents.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Low Findable -- High FindablePatents

Perc

enta

ge

Findability PercentageQueries Percentage

On Average = 5.5% queries can retrieve (> 1000 & < =1500) patents.

In all (>1000 & <= 1500) queries, patents are finable from only 5.3% queries.

Out of 100 queries, patent is findable from almost 5 queries

On Average = 8% queries can retrieve (> 1000 & <= 1500) patents. In all (> 1000 & <= 1500) queries, patents are finable from only 67% queries.Out of 100 queries, patent is findable from almost 67 queries

Findability Percentage in Different Queries Queries which can retrieve more than >= 500 & <= 1000

patents. On Average = 13% queries can retrieve (>= 500 & <= 1000) patents. In all (>= 500 & <= 1000) queries, patents are finable from only 52% queries.Out of 100 queries, patent is findable from almost 52 queries

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Low Findable -- High FindablePatents

Perc

enta

ge

Findability PercentageQueries Percentage

On Average = 8.5% queries can retrieve (>= 500 & <= 1000) patents.

In all (>= 500 & <= 1000) queries, patents are finable from only 7% queries.

Out of 100 queries, patent is findable from almost 7 queries

Findability Percentage in Different Queries Queries which can retrieve more than < 500 patents.

On Average = 14% queries can retrieve < 500 patents. In all < 500 queries, patents are finable from only 65% queries.Out of 100 queries, patent is findable from almost 65 queries

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Low Findable -- High FindablePatents

Perc

enta

ge

Findability PercentageQueries Percentage

On Average = 7% queries can retrieve < patents.

In all < 500 queries, patents are finable from only 16% queries.

Out of 100 queries, patent is findable from almost 16 queries

Effect of Individual Terms on Findability

A patent contains many unique terms.

Are patents Findable from most of Terms, or a small number of Terms create major impact on Findability

This factor analyzes the effect of individual terms of patents on Findability score

Does removing a small percentage of frequent Terms from Queries, decrease a large percentage of Findability score

What is the effect of this factor on Low Findable and High Findable patents

Effect of Individual Terms on Findability On Low Findable Patents, removing small percentage of Frequent

Terms quickly decrease the Findability as compared to High Findable Patents.

0%

20%

40%

60%

80%

100%

120%

0% 10% 20% 30% 40% 50% 60%

% Frequent Terms Removed

Fin

da

bil

ity

Sc

ore

De

cre

as

ed

Low Findable Patents

High Findable Patents

Issues in Queries used for Findability Analysis

It is very time consuming, to analyze Findability using all possible queries of a patent. What about other combinations, 4 terms, 5 terms, 6

terms.. What about other Boolean operators (OR, NOT).

How we can prune irrelevant queries. Query Performance Prediction, such as clarity score

may be help us in Pruning Irrelevant Queries.

Query Log can help us in building Simulated Queries.