profiling web archive coverage for top-level domain & content language ahmed alsum, michele c....

23
Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International Conference on Theory and Practice of Digital Libraries September 22-26, 2013 Valletta, Malta 1

Upload: erica-griffith

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

1

Profiling Web Archive Coverage

for

Top-Level Domain & Content Language

Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, andHerbert Van de Sompel

International Conference on Theory and Practice of Digital LibrariesSeptember 22-26, 2013

Valletta, Malta

Page 2: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

2

Page 3: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

Where can you find?

3

http://www.google.com/

Page 4: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

Where can you find?

4

http://www.google.com/

Page 5: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

Where can you find?

5

http://www.japantimes.co.jp/

Page 6: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

Where can you find?

6

http://www.japantimes.co.jp/

Page 7: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

7

Research QuestionProblem• We need to profile the web archives around the

world with these characteristics:o Ageo Top-level domainso Languageso Growth rate

Goal• To optimize the query routing for Memento

Aggregator.• To determine the missing parts of the web.

Page 8: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

8

Web Archives under this Experiment

Full text URI-lookup

Internet Archive √

Library of Congress √

Icelandic Web Archive √

Library and Archives Canada √ √

British Library √ √

UK National Library √ √

Portuguese Web Archive √ √

Web Archive of Catalonia √ √

Croatian Web Archive √ √

Archive of the Czech Web √ √

National Taiwan University √ √

Archive It √ √

Page 9: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

9

Experiment• Sampling from different sources• Retrieve the TimeMap from each archive• Analyze the TimeMaps

Page 10: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

10

URIs Samples Sources

Web1. DMOZ – Random

sample2. DMOZ – TLD 2% of

each TLD from DMOZ (.com, .org, .jp, etc 52 TLD)

3. DMOZ – Languages 100 URIs for each Languages (24 lang.)

Web Archives4. Top 1-Gram from

Bing5. Top 1000 queries

term by Yahoo in 9 languages

User requests6. IA Wayback Machine Log

files7. Memento aggregator log

files* We used hostnames only

Page 11: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

11

URIs Samples SourcesWeb

1. DMOZ – Random sampleo 10,000 URIs randomly sample from DMOZ directory (~5M URIs).

2. DMOZ – TLD 2% for each TLD from DMOZ or 100 URIs which are greatero 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net 2,073), (it

1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au 764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319), (cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149), (tw 141), (za 118), (fi 113), ( 100 URIS for [ae, cat, cl, cu, eg, gov, id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy, zw])

3. DMOZ – Languages 100 URIs for each Languageso 24 languages: Icelandic, Portuguese, Catalan, Afrikaans, Arabic, Indonesian,

Chinese (Simplified), Chinese (Traditional), Dutch, Spanish, French, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Persian, Polish , Russian, Turkish, Ukrainian

Page 12: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

12

URIs Samples SourcesWeb Archive

• Query the fulltext search interface for the web archives with two set of query terms.

4. Top 1-Gram from Bingo Most of them is English

5. Top 1000 queries term by Yahoo in 9 languageso We excluded the general keywords such as: Obama, Facebook.

Page 13: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

13

URIs Samples SourcesWeb Archive

Chinese

English

French

German

Italian

Japanese

Korean

Portugue

se

Spanish Total

Top 1 Gram

Archive with FullText search

AIT 26 2066 3512 3837 3321 119 2 2434 21411261

7395

3

BL 163 2354 2350 2240 2068 225 131 1940 2056 6430318

7

CAN 49 800 804 646 601 77 113 580 514 1351110

7

CR 54 706 697 703 701 74 19 599 600 1599120

1

CZ 363 1782 1578 1695 1519 577 114 1310 1278 6081336

0

CAT 28 2775 2496 2448 2280 209 129 2164 2429 8996424

1

PO 91 2460 3603 3081 3113 53 69 3267 31771412

6500

4

TW 357 178 176 165 157 106 7 198 119 1004 354

UK 0 2698 2009 2049 2046 0 0 1903 1871 8261343

1

Page 14: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

14

URIs Samples SourcesUser requests

• Sampling from the users requests to the web archived materials

6. Sample from IA Wayback Machine Log fileso 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26, 2012.

7. Sample from Memento aggregator log fileso 100 URIs randomly sampled from LANL Memento Aggregator between

2011 to 2013.

Page 15: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

15

General Coverage

Page 16: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

16

Web Archive Growth Rate

Page 17: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

17

TLD Sample Coverage

Page 18: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

18

TLD per archive (TLD Sample)

Page 19: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

19

TLD per archive (Fulltext search)

Page 20: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

20

TLD across archives

Page 21: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

21

Languages distribution per

archive

Page 22: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

22

Query Routing Evaluation

Page 23: Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International

23

Conclusions• New automatic technique to profile the web

archive using the available interface.• Internet Archive provide broad coveage.• National archives have good coverage for their

domains.• The evaluation showed that we can retrieve the

full TimeMap in 84% using only 3 archives.