archive-it architecture introduction
DESCRIPTION
Archive-It Architecture Introduction. April 3, 2006 Dan Avery Internet Archive. Archive-It Components. Crawling User Interface Storage Playback Text Indexing Integration. Component Integration. Crawling. Heritrix ( http://crawler.archive.org / ) Java application Open source (LGPL) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/1.jpg)
Archive-It Architecture Introduction
April 3, 2006Dan Avery
Internet Archive
![Page 2: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/2.jpg)
Archive-It Components
•Crawling
•User Interface
•Storage
•Playback
•Text Indexing
•Integration
![Page 3: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/3.jpg)
Component Integration
![Page 4: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/4.jpg)
Crawling
•Heritrix ( http://crawler.archive.org/ )
•Java application
•Open source (LGPL)
•Crawls for completeness/depth
•Highly configurable
![Page 5: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/5.jpg)
Crawling - Distributed Crawling•Heritrix Cluster Controller
•Java component - open source - developed by IA
•http://crawler.archive.org/hcc
•Provides proxy access to pool of Heritrix instances through JMX interface
•Provides crawler control and status
•Currently controlling 33 crawler instances on three commodity dual Opterons--upper bound unknown
![Page 6: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/6.jpg)
Archive-It Web Application
• User Interface and Crawl Scheduling
• Gets seed URLs and crawl parameters from users
• Schedules new periodic crawls
• Talks to crawler pool through HCC
• Provides access, search, and crawl history UI
![Page 7: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/7.jpg)
Storage•archive.org ARC repository
•custom Perl system
•simple storage on primary/backup pairs
•monthly MD5 digest verification
•robust, non proprietary file format
•Alexandria (Egypt)/Amsterdam
![Page 8: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/8.jpg)
Access• Internet Archive Wayback
Machine
• Replaying archived web pages since 2001
• Current IA version written in Perl and C, with components distributed across various machines
• Not open source, but open source beta (in Java) available now
![Page 9: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/9.jpg)
Full-Text Indexing
•Nutch (http://nutch.org)
•NutchWAX (http://archive-access.sf.net) additions create and search indexes of stored ARC files
•Standard text search plus link analysis
•can search by date instead of relevance, useful for individual archives
![Page 10: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/10.jpg)
Text Indexing Challenges•Some parts are distributable,
some are not
•Incremental indexing - goal of new crawls in index within 72 hours
•Working on Archive-It usable map/reduce version - July
•In the meantime, a lot of workarounds
![Page 11: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/11.jpg)
Integration•Group of Perl and bash scripts - planning more complex than the execution
•Most components available individually
•Decentralized control, centralized monitoring
•Each component operates almost entirely independently
![Page 12: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/12.jpg)
The Big Picture
![Page 13: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/13.jpg)
Future Challenges
•Crawler trap detection
•Scalability
•Current setup can accommodate 300 partners at current crawling rates
•During pilot we crawled/indexed/stored just over 100,000,000 documents (~4TB) in eight weeks
•More machines can be easily added to storage and crawling clusters
![Page 14: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/14.jpg)
Scalability
•Current Nutch is between versions
•Old version has some non-distributable pieces
•New version is much more distributable and scalable (map/reduce - Hadoop), but not ready for incremental indexing
![Page 15: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/15.jpg)
Looking ahead•After basic UI/archiving/indexing...
•Time-based search UI
•Analyzing archives for research and ongoing collection improvement
•Content classification
•Rate of change
•New site suggestions
![Page 17: Archive-It Architecture Introduction](https://reader035.vdocuments.us/reader035/viewer/2022062722/56813ad5550346895da30c9d/html5/thumbnails/17.jpg)
RLG’s Web Archiving Program•Collaborative collection
development.
•Descriptive metadata for web archives.
•Usability/user studies
•Intellectual property concerns
•Web Archiving 101
•Web archiving services and software