rules of thumb for information acquisition from large and redundant data

Rules of Thumb for Information Acquisition

from Large and Redundant Data

Wolfgang Gatterbauer

http://UniqueRecall.comDatabase groupUniversity of Washington

Version April 21, 2011

33rd European Conference on Information Retrieval (ECIR'11)

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Information Acquisition from Redundant DataPareto principle (80-20 rule) 20% causes 80% effect

e.g. business clients salese.g. software bugs errorse.g. health care patients HC resources

Information acquisition ? 20% data(instances)

? 80% information(concepts)

e.g. words in a corpus all words different wordse.g. used first names individual names different names

e.g. web harvesting web data web information

Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3

Information Acquisition from Redundant Data

Au A B Bu"Unique"

informationAvailable

DataRetrieved and extracted data

Acquired information

InformationIntegration

Information Retrieval, Information Extraction

InformationDissemination

Information Acquisition

Redundancydistribution k Recall r Expected sample

distribution k

Expected unique recall ru

Three assumptions• no disambiguity in data• random sampling w/o replacement• very large data sets

Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion

A Simple Balls-and-Urn Sampling Model

Redundancy Distribution kk=(6,3,3,2,1)

Information i

Redundancy ki

1 2 3 4 5

frequency of i-th most often appearing information (color)

(# colors: au=5)

(# balls: a=15)

Redundancy ki

1 2 3 4 5

Sampled Data(# balls: b=3)

Sampled Information i(# Colors: bu=2)

Recall rr=3/15=0.2

Unique recall ru=2/5=0.4Sample Redundancy

distribution k=(2,1)

A model for sampling in the Limit of large data sets

1 2 3 4 51

1 3 5 7 91

2 4 6 8 10k=(6,3,3,2,1) k=(6,6,3,3,3,3,2,2,1,1)

1 2 3 4 51

a=(0.2,0.2,0.4,0,0,0.2)

a=15 a∞a=30

k2-3=3

a3=0.4

vertical perspective horizontal perspectivek3-6=3

a3=0.4 a3=0.4

The Intuition for constant redundancy k

ru = 1-(1-0.5)3=0.875

k=const=3

Unique recall

Expected sample distribution

2=(3 choose 2)0.52(1-0.5)1=0.375 Binomial distribution

Indep. sampling with p=r

lima∞

The Intuition for arbitrary redundancy distributions

a3=0.4

a6= 0.2

a2= 0.2

a1= 0.2

Redundancy k

0.2 0.60 0.8 1

ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ] + a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ]

Stratified sampling

lima∞

A horizontal Perspective for Sampling

au=5 au=20

au=100 au=1000

Horizontal layer of redundancy 1=0.8

Expected sampleredundancy k1=3

bu=800

Unique Recall for arbitrary redundancy distributions

Outline

Three formulations of Power law distributions

Stumpf et al. [PNAS’05]

Mitzenmacher [Internet Math.’04]

Adamic [TR’00]

Mitzenmacher [IM’04]

redundancy kiredundancy frequency ak

complementarycumulativefrequency k

Zipf-Mandelbrot ParetoPower-law probability Zipf [1932]

Commonly assumed to be different represen-tations of the same distribution

Unique Recall with Power laws

For =1log-log plot

Unique Recall with Power laws

Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information

For =1

Invariants under Sampling

Given our model: Which redundancy distribution remains invariant under sampling?

Invariant is a power-law! Hence, the tail of allpower laws remains invariant under sampling

ruk=k/k

Also, the power law tail breaks in

Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows

Outline

Sampling from Real DataIncoming links for one domain

Tag distribution on delicious.com

Rule of Thumb 1:not good!

Rule of Thumb 1:perfect!

Rule of Thumb 2:works, but only for small area

Rule of Thumb 2:works!

Theory on Sampling from Power-Laws

"Here, we show that random subnets sampled from scale-free networks are not themselves scale-free."

This paper

"Here, we show that there is one power-law family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too.

Some other related workPopulation sampling

Downey et al. [IJCAI’05]

Ipeirotis et al. [Sigmod’06]

• urn model to estimate the probability that extracted information is correct.

• random sampling with replacement

• show that random subnets sampled from scale-free networks are not scale-free

• decision framework to search or to crawl• random sampling without replacement with

known population sizes

• goal: estimate size of population• e.g. mark-recapture animal sampling• sampling of small fraction w/ replacement

Bar-Yossef, Gurevich [WWW’06]

• biased methods to sample from search engine's index to estimate index size

Summary 1/2 Au A B Bu

Information AvailableData

Retrieved Data

Acquired information

Inf. IntegrationIR & IEInf. Dissemination

Recall rUnique recall ru

• A simple model ofinformation acquisition from redundant data

• Full analytic solution

Inf. Acquisition

- no disambiguity- random sampling

w/o replacementRedundancy

distribution k

- large data

Normalized Redundancy

distribution a

ruk(r)

ruk=k/k

Sampledistribution k

Summary 2/2

• Rule of thumb 1:

• Rule of thumb 2:

- 80/20 40/20

- power-law coreremains invariant

Unique recall for power-laws3 different power-laws

Sampling from power-lawsInvariant distribution

- sensitive to exact power-law root

http://uniqueRecall.com

backup

Geometric interpretation of k(, k, r)

BACKUP

Information Acquisition from Redundant Data3 pieces of data, containing 2 pieces of (“unique”) information*

Capurro, Hjørland [ARIST’03]

e.g. words of a corpus:word appearances / vocabulary

e.g. used first names in a groupindividual names / different names

e.g. web harvesting:web data / web information

Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?

Data(instances)

Information(concepts)

*data interpreted as redundant representation of information

rules of thumb for information acquisition from large and redundant data

redundant data rules

redundant data http

information color

com3information acquisition

sampled information

redundant datapareto

sample redundancy distribution

horizontal layer of

Documents

parallel redundant katalog - deutsche messe...

the acquisition and teaching of the spanish subjunctive...

redundant - manual

guidelines for capturing palmprints and supplementals ·...

thumb rules1

thumb rule

thumb tape

thumb game

redundant trusses

green thumb

thumb injuries jointsjoints metacarpophalangeal...

rules of thumb for information acquisition from large and...

picture books thumb thumb books online catalogue.pdfpicture...

thumb hypoplesia

exploiting sparsity in time-of-ﬂight range acquisition...

optimization of industrial freeze drying cycle - two case...

thumb - approach - dorsoulnar to mcp joint of thumb - ao...

thumb nailing

thumb drive

left thumb over right thumb jjc