rules of thumb for information acquisition from large and redundant data
DESCRIPTION
Version April 21, 2011. Rules of Thumb for Information Acquisition from Large and Redundant Data. Wolfgang Gatterbauer. 33 rd European Conference on Information Retrieval (ECIR'11). Database group University of Washington. http://UniqueRecall.com. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/1.jpg)
Rules of Thumb for Information Acquisition
from Large and Redundant Data
Wolfgang Gatterbauer
http://UniqueRecall.comDatabase groupUniversity of Washington
Version April 21, 2011
33rd European Conference on Information Retrieval (ECIR'11)
![Page 2: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/2.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Information Acquisition from Redundant DataPareto principle (80-20 rule) 20% causes 80% effect
e.g. business clients salese.g. software bugs errorse.g. health care patients HC resources
Information acquisition ? 20% data(instances)
? 80% information(concepts)
e.g. words in a corpus all words different wordse.g. used first names individual names different names
e.g. web harvesting web data web information
Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?
![Page 3: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/3.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3
Information Acquisition from Redundant Data
Au A B Bu"Unique"
informationAvailable
DataRetrieved and extracted data
Acquired information
InformationIntegration
Information Retrieval, Information Extraction
InformationDissemination
Information Acquisition
Redundancydistribution k Recall r Expected sample
distribution k
Expected unique recall ru
Three assumptions• no disambiguity in data• random sampling w/o replacement• very large data sets
![Page 4: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/4.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 4
Outline
• A horizontal sampling model• The role of power-laws• Real data & Discussion
![Page 5: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/5.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
A Simple Balls-and-Urn Sampling Model
Redundancy Distribution kk=(6,3,3,2,1)
Information i
Redundancy ki
Data
1 2 3 4 5
1
5
6
4
3
2
frequency of i-th most often appearing information (color)
(# colors: au=5)
(# balls: a=15)
Redundancy ki
1 2 3 4 5
1
5
6
4
3
2
Sampled Data(# balls: b=3)
Sampled Information i(# Colors: bu=2)
Recall rr=3/15=0.2
Unique recall ru=2/5=0.4Sample Redundancy
distribution k=(2,1)
5
![Page 6: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/6.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
A model for sampling in the Limit of large data sets
1 2 3 4 51
56
432
6
1 3 5 7 91
56
432
2 4 6 8 10k=(6,3,3,2,1) k=(6,6,3,3,3,3,2,2,1,1)
1 2 3 4 51
56
432
a=(0.2,0.2,0.4,0,0,0.2)
a=15 a∞a=30
k2-3=3
a3=0.4
vertical perspective horizontal perspectivek3-6=3
a3=0.4 a3=0.4
...
![Page 7: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/7.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
The Intuition for constant redundancy k
0 1
ru = 1-(1-0.5)3=0.875
ru
1
2
3
k=const=3
7
Unique recall
2
Expected sample distribution
2=(3 choose 2)0.52(1-0.5)1=0.375 Binomial distribution
Indep. sampling with p=r
lima∞
r=0.5
![Page 8: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/8.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
The Intuition for arbitrary redundancy distributions
a3=0.4
a6= 0.2
a2= 0.2
a1= 0.2
1
2
3
5
6
4
Redundancy k
0.2 0.60 0.8 1
8
ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ] + a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ]
Stratified sampling
lima∞
r=0.5
2
![Page 9: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/9.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
A horizontal Perspective for Sampling
9
au=5 au=20
au=100 au=1000
Horizontal layer of redundancy 1=0.8
Expected sampleredundancy k1=3
r=0.5
bu=800
![Page 10: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/10.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 10
Unique Recall for arbitrary redundancy distributions
10
![Page 11: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/11.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 11
Outline
• A horizontal sampling model• The role of power-laws• Real data & Discussion
![Page 12: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/12.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 12
Three formulations of Power law distributions
12
Stumpf et al. [PNAS’05]
Mitzenmacher [Internet Math.’04]
Adamic [TR’00]
Mitzenmacher [IM’04]
redundancy kiredundancy frequency ak
complementarycumulativefrequency k
Zipf-Mandelbrot ParetoPower-law probability Zipf [1932]
Commonly assumed to be different represen-tations of the same distribution
![Page 13: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/13.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Unique Recall with Power laws
13 13
For =1log-log plot
![Page 14: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/14.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Unique Recall with Power laws
14 14
Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information
For =1
![Page 15: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/15.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Invariants under Sampling
15 15
Given our model: Which redundancy distribution remains invariant under sampling?
![Page 16: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/16.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 16
Invariant is a power-law! Hence, the tail of allpower laws remains invariant under sampling
16
ruk=k/k
![Page 17: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/17.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Also, the power law tail breaks in
17 17
Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows
![Page 18: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/18.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 18
Outline
• A horizontal sampling model• The role of power-laws• Real data & Discussion
![Page 19: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/19.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 19
Sampling from Real DataIncoming links for one domain
Tag distribution on delicious.com
Rule of Thumb 1:not good!
Rule of Thumb 1:perfect!
Rule of Thumb 2:works, but only for small area
Rule of Thumb 2:works!
![Page 20: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/20.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Theory on Sampling from Power-Laws
Stumpf et al. [PNAS’05]
"Here, we show that random subnets sampled from scale-free networks are not themselves scale-free."
This paper
"Here, we show that there is one power-law family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too.
![Page 21: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/21.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Some other related workPopulation sampling
Downey et al. [IJCAI’05]
Ipeirotis et al. [Sigmod’06]
Stumpf et al. [PNAS’05]
• urn model to estimate the probability that extracted information is correct.
• random sampling with replacement
• show that random subnets sampled from scale-free networks are not scale-free
• decision framework to search or to crawl• random sampling without replacement with
known population sizes
• goal: estimate size of population• e.g. mark-recapture animal sampling• sampling of small fraction w/ replacement
Bar-Yossef, Gurevich [WWW’06]
• biased methods to sample from search engine's index to estimate index size
![Page 22: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/22.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 22
Summary 1/2 Au A B Bu
Information AvailableData
Retrieved Data
Acquired information
Inf. IntegrationIR & IEInf. Dissemination
Recall rUnique recall ru
• A simple model ofinformation acquisition from redundant data
• Full analytic solution
Inf. Acquisition
- no disambiguity- random sampling
w/o replacementRedundancy
distribution k
- large data
Normalized Redundancy
distribution a
ru(r)
ruk(r)
ruk=k/k
Sampledistribution k
![Page 23: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/23.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 23
Summary 2/2
• Rule of thumb 1:
• Rule of thumb 2:
- 80/20 40/20
- power-law coreremains invariant
Unique recall for power-laws3 different power-laws
Sampling from power-lawsInvariant distribution
ruk r
- sensitive to exact power-law root
http://uniqueRecall.com
![Page 24: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/24.jpg)
24
backup
![Page 25: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/25.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Geometric interpretation of k(, k, r)
25
BACKUP
![Page 26: Rules of Thumb for Information Acquisition from Large and Redundant Data](https://reader035.vdocuments.us/reader035/viewer/2022070501/5681693d550346895de0b535/html5/thumbnails/26.jpg)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 26
Information Acquisition from Redundant Data3 pieces of data, containing 2 pieces of (“unique”) information*
Capurro, Hjørland [ARIST’03]
e.g. words of a corpus:word appearances / vocabulary
e.g. used first names in a groupindividual names / different names
e.g. web harvesting:web data / web information
Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?
Data(instances)
Information(concepts)
*data interpreted as redundant representation of information