rules of thumb for information acquisition from large and redundant data

Post on 25-Feb-2016

21 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Version April 21, 2011. Rules of Thumb for Information Acquisition from Large and Redundant Data. Wolfgang Gatterbauer. 33 rd European Conference on Information Retrieval (ECIR'11). Database group University of Washington. http://UniqueRecall.com. - PowerPoint PPT Presentation

TRANSCRIPT

Rules of Thumb for Information Acquisition

from Large and Redundant Data

Wolfgang Gatterbauer

http://UniqueRecall.comDatabase groupUniversity of Washington

Version April 21, 2011

33rd European Conference on Information Retrieval (ECIR'11)

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Information Acquisition from Redundant DataPareto principle (80-20 rule) 20% causes 80% effect

e.g. business clients salese.g. software bugs errorse.g. health care patients HC resources

Information acquisition ? 20% data(instances)

? 80% information(concepts)

e.g. words in a corpus all words different wordse.g. used first names individual names different names

e.g. web harvesting web data web information

Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3

Information Acquisition from Redundant Data

Au A B Bu"Unique"

informationAvailable

DataRetrieved and extracted data

Acquired information

InformationIntegration

Information Retrieval, Information Extraction

InformationDissemination

Information Acquisition

Redundancydistribution k Recall r Expected sample

distribution k

Expected unique recall ru

Three assumptions• no disambiguity in data• random sampling w/o replacement• very large data sets

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 4

Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

A Simple Balls-and-Urn Sampling Model

Redundancy Distribution kk=(6,3,3,2,1)

Information i

Redundancy ki

Data

1 2 3 4 5

1

5

6

4

3

2

frequency of i-th most often appearing information (color)

(# colors: au=5)

(# balls: a=15)

Redundancy ki

1 2 3 4 5

1

5

6

4

3

2

Sampled Data(# balls: b=3)

Sampled Information i(# Colors: bu=2)

Recall rr=3/15=0.2

Unique recall ru=2/5=0.4Sample Redundancy

distribution k=(2,1)

5

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

A model for sampling in the Limit of large data sets

1 2 3 4 51

56

432

6

1 3 5 7 91

56

432

2 4 6 8 10k=(6,3,3,2,1) k=(6,6,3,3,3,3,2,2,1,1)

1 2 3 4 51

56

432

a=(0.2,0.2,0.4,0,0,0.2)

a=15 a∞a=30

k2-3=3

a3=0.4

vertical perspective horizontal perspectivek3-6=3

a3=0.4 a3=0.4

...

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

The Intuition for constant redundancy k

0 1

ru = 1-(1-0.5)3=0.875

ru

1

2

3

k=const=3

7

Unique recall

2

Expected sample distribution

2=(3 choose 2)0.52(1-0.5)1=0.375 Binomial distribution

Indep. sampling with p=r

lima∞

r=0.5

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

The Intuition for arbitrary redundancy distributions

a3=0.4

a6= 0.2

a2= 0.2

a1= 0.2

1

2

3

5

6

4

Redundancy k

0.2 0.60 0.8 1

8

ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ] + a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ]

Stratified sampling

lima∞

r=0.5

2

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

A horizontal Perspective for Sampling

9

au=5 au=20

au=100 au=1000

Horizontal layer of redundancy 1=0.8

Expected sampleredundancy k1=3

r=0.5

bu=800

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 10

Unique Recall for arbitrary redundancy distributions

10

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 11

Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 12

Three formulations of Power law distributions

12

Stumpf et al. [PNAS’05]

Mitzenmacher [Internet Math.’04]

Adamic [TR’00]

Mitzenmacher [IM’04]

redundancy kiredundancy frequency ak

complementarycumulativefrequency k

Zipf-Mandelbrot ParetoPower-law probability Zipf [1932]

Commonly assumed to be different represen-tations of the same distribution

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Unique Recall with Power laws

13 13

For =1log-log plot

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Unique Recall with Power laws

14 14

Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information

For =1

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Invariants under Sampling

15 15

Given our model: Which redundancy distribution remains invariant under sampling?

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 16

Invariant is a power-law! Hence, the tail of allpower laws remains invariant under sampling

16

ruk=k/k

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Also, the power law tail breaks in

17 17

Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 18

Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 19

Sampling from Real DataIncoming links for one domain

Tag distribution on delicious.com

Rule of Thumb 1:not good!

Rule of Thumb 1:perfect!

Rule of Thumb 2:works, but only for small area

Rule of Thumb 2:works!

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Theory on Sampling from Power-Laws

Stumpf et al. [PNAS’05]

"Here, we show that random subnets sampled from scale-free networks are not themselves scale-free."

This paper

"Here, we show that there is one power-law family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too.

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Some other related workPopulation sampling

Downey et al. [IJCAI’05]

Ipeirotis et al. [Sigmod’06]

Stumpf et al. [PNAS’05]

• urn model to estimate the probability that extracted information is correct.

• random sampling with replacement

• show that random subnets sampled from scale-free networks are not scale-free

• decision framework to search or to crawl• random sampling without replacement with

known population sizes

• goal: estimate size of population• e.g. mark-recapture animal sampling• sampling of small fraction w/ replacement

Bar-Yossef, Gurevich [WWW’06]

• biased methods to sample from search engine's index to estimate index size

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 22

Summary 1/2 Au A B Bu

Information AvailableData

Retrieved Data

Acquired information

Inf. IntegrationIR & IEInf. Dissemination

Recall rUnique recall ru

• A simple model ofinformation acquisition from redundant data

• Full analytic solution

Inf. Acquisition

- no disambiguity- random sampling

w/o replacementRedundancy

distribution k

- large data

Normalized Redundancy

distribution a

ru(r)

ruk(r)

ruk=k/k

Sampledistribution k

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 23

Summary 2/2

• Rule of thumb 1:

• Rule of thumb 2:

- 80/20 40/20

- power-law coreremains invariant

Unique recall for power-laws3 different power-laws

Sampling from power-lawsInvariant distribution

ruk r

- sensitive to exact power-law root

http://uniqueRecall.com

24

backup

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Geometric interpretation of k(, k, r)

25

BACKUP

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 26

Information Acquisition from Redundant Data3 pieces of data, containing 2 pieces of (“unique”) information*

Capurro, Hjørland [ARIST’03]

e.g. words of a corpus:word appearances / vocabulary

e.g. used first names in a groupindividual names / different names

e.g. web harvesting:web data / web information

Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?

Data(instances)

Information(concepts)

*data interpreted as redundant representation of information

top related