aleph archives iipc presentation

12
WEB ARCHIVING BUCKET IIPC — Crawl Meeting Aleph Archives Toronto October 3 rd , 2012 Marco Roy Project & Channel Manager [email protected] 1 Tuesday, October 2, 12

Upload: aleph-archives

Post on 12-Jul-2015

200 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Aleph Archives iipc presentation

WEB ARCHIVING BUCKETIIPC — Crawl Meeting

Aleph Archives

TorontoOctober 3rd, 2012

Marco RoyProject & Channel [email protected]

1Tuesday, October 2, 12

Page 2: Aleph Archives iipc presentation

Web Archiving BucketA set of tools to simplify Web Archiving

© ALEPH ARCHIVES 2012

Sep. 5th, 2012WSDK 1.0.0

Dec. 24th, 2012Cobalt 1.0.0

RELEASES TIMELINE

2Tuesday, October 2, 12

Page 3: Aleph Archives iipc presentation

© ALEPH ARCHIVES 2012

Web Archiving Bucket

Software characteristics

User-Friendly

Up-To-Date

Highly Optimized

No Compromise

Cross-Platform (Windows, Linux, etc.): 32/64-bit

Adapted from Aleph’s Production Code

3Tuesday, October 2, 12

Page 4: Aleph Archives iipc presentation

Warc SoftwareDevelopment Kit

API for building Web Archiving software

© ALEPH ARCHIVES 20124Tuesday, October 2, 12

Page 5: Aleph Archives iipc presentation

© ALEPH ARCHIVES 2012

Web Archiving Bucket

WSDK

Key Benefits

Time and Resource saving

Ready for Cloud development

Build Highly Scalable software

5Tuesday, October 2, 12

Page 6: Aleph Archives iipc presentation

© ALEPH ARCHIVES 2012

Web Archiving Bucket

WSDK

Technical Specifications

Multi-Core aware

Robust Networking Stack

Tersness (~300Ko): fast lexers, parsers, etc.

Carefully designed algorithms#mod LoC

Erlang 17 2832

C 2 623

6Tuesday, October 2, 12

Page 7: Aleph Archives iipc presentation

© ALEPH ARCHIVES 2012

Web Archiving Bucket

Multi-Core Speed Test

1-Core 4-Core

CountTime

103.7 sec. 69.9 sec.

Architecture x86_64CPU op-mode 32-bit, 64-bitCPU(s) 4On-line CPU(s) list 0-3Thread(s) per core 1Core(s) per socket 4Socket(s) 1Vendor ID GenuineIntelCPU family 6Model 23Stepping 7CPU MHz 2499.876BogoMIPS 4999.75Virtualisation VT-xL1d cache 32KL1i cache 32KL2 cache 4096K

4 WARCshttp://archive.org/details/testWARC!les

4 WARCshttp://archive.org/details/testWARC!les

WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz 970MB

WIDE-20110225184020081-04372-13730~crawl301.us.archive.org~9443.warc.gz 957MB

WIDE-20110225210142891-04382-13730~crawl301.us.archive.org~9443.warc.gz 956MB

WIDE-20110225221304846-04388-13730~crawl301.us.archive.org~9443.warc.gz 969MB

Lin

ux S

erv

er

VIDEO

7Tuesday, October 2, 12

Page 8: Aleph Archives iipc presentation

© ALEPH ARCHIVES 2012

Web Archiving Bucket

WSDK

Next release

Remote WARC manipulation API (REST)

WARC Writing Proxy

8Tuesday, October 2, 12

Page 9: Aleph Archives iipc presentation

COBALTWeb Archives Playback Cluster

© ALEPH ARCHIVES 2012

cobalt 01 cobalt 02 cobalt 03

cobalt 08 cobalt 09 cobalt 10

...

...

cobaltload balancer

9Tuesday, October 2, 12

Page 10: Aleph Archives iipc presentation

© ALEPH ARCHIVES 2012

Web Archiving Bucket

COBALT

Key Benefits

No Configuration

No Single Point of Failure (SPOF)

Fast Web Archives Access

10Tuesday, October 2, 12

Page 11: Aleph Archives iipc presentation

© ALEPH ARCHIVES 2012

Web Archiving Bucket

COBALT

Technical Specifications

Clustered architecture

Automatic Resources Discovery

Playback Proxy by default

Modern and fast WARCs indexer

11Tuesday, October 2, 12

Page 12: Aleph Archives iipc presentation

aleph-archives.com

☛ webarchivingbucket.com

12Tuesday, October 2, 12