classical model: web harvesting

10

Upload: abiola

Post on 14-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Classical Model: Web Harvesting. Pull from queue. GET /. W/ARC -. text/css image/gif image/jpg video JavaScript. HTTP/1.0 200 OK. Differences Between a Crawler and a Browser. Browsers grab all embedded resources as soon as possible - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Classical Model:  Web Harvesting
Page 2: Classical Model:  Web Harvesting

Classical Model: Web Harvesting

W/ARC - GET /

HTTP/1.0 200 OK

text/cssimage/gifimage/jpgvideoJavaScript

Pull from queue

Page 3: Classical Model:  Web Harvesting

Differences Between a Crawler and a Browser

Browsers grab all embedded resources as soon as possible Typical behavior is a burst of traffic followed by long

pauses. Crawlers have to play by different rules

Typical behavior is sustained traffic. Can quickly overwhelm a website Must apply intentional delays

Must obey robots.txt rules

Page 4: Classical Model:  Web Harvesting

In Your Browser: Behind the Scenes

Page 5: Classical Model:  Web Harvesting

Browser Experiments

Integrate a browser into link extraction pipeline: Log all requests and queue in Heritrix

Headless browsers Smaller & Faster

No need to render the page visually

PhantomJS - built on webkit engine (Safari, iOS, Chrome, Android) Auto-QA, snapshot generation, scripted navigation…

Page 6: Classical Model:  Web Harvesting

A Different Approach

Merging browsers & crawlers.…

Page 7: Classical Model:  Web Harvesting

Recent Use Cases & Implementations

Open Planets (browser extractor module for H3):

https://www.github.com/openplanets/wap

INA (browser w/inline caching proxy; simulates user actions):

https://github.com/davidrapin/fantomas

NDIIPP/NDSA (a different hybrid approach…):

https://github.com/adam-miller/ExternalBrowserExtractorHTML

https://github.com/adam-miller/phantomBrowserExtractor (PhantomJS behavior scripts)

Page 8: Classical Model:  Web Harvesting

How much do we gain?

Traditional Link Extraction: Baseline Test 7444 URIs (200 response)

795 URIs (404 response

Browser only (full instance or scripted headless): ~30% less content

PhantomJS (WITH traditional link extractor): +24%

+ Significant improvement in unique URI detection

- Additional processing overhead

…but can distribute load to dedicated browser nodes

+ Browser downloads in a separate workflow, asynchronous from Heritrix

+ JavaScript analytics

Page 9: Classical Model:  Web Harvesting

Other Strategies/Implementations

Data Mining & Analytics Pre-Crawl Seed & Link Analysis

Link/Script Analysis during an Active Crawl

Post Crawl Link/Script Analysis, Patching & Auto QA

Native Feeds & Alternate Capture Methods Data format and context is as important as the content

E.g.

Snapshot Generation & Recording

Page 10: Classical Model:  Web Harvesting

This is NOT Your Grandfather’s Web

Kris Carpenter NegulescuDirector, Web

Internet Archive

kcarpenter (at) archive (dot) org