webrecorder.io building a new archiving service for everyone!

49
WebRecorder.io Building a new archiving service for everyone!

Upload: rose-potter

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WebRecorder.io Building a new archiving service for everyone!

WebRecorder.io

Building a new archiving service for everyone!

Page 2: WebRecorder.io Building a new archiving service for everyone!

What is WebRecorder.io?

On-Demand Archiving through the browser

Page 3: WebRecorder.io Building a new archiving service for everyone!

What is WebRecorder.io?

On-Demand Archiving through the browser

What you see is what you archive (WYSIWYA)

Page 4: WebRecorder.io Building a new archiving service for everyone!

What is WebRecorder.io?

On-Demand Archiving through the browser

What you see is what you archive (WYSIWYA)

Available to anyone!

Page 5: WebRecorder.io Building a new archiving service for everyone!

What is WebRecorder.io?

On-Demand Archiving through the browser.

What you see is what you archive (WYSIWYA)

Available to anyone!

“Quality over Quantity” - High-Fidelity Replay of Web Content

Page 6: WebRecorder.io Building a new archiving service for everyone!

Current Service

Proof-of-Concept

Page 7: WebRecorder.io Building a new archiving service for everyone!

Current Service

Proof-of-Concept Users can record a page and browse

Page 8: WebRecorder.io Building a new archiving service for everyone!

Current Service

Proof-of-Concept Users can record a page and browse Users can download the WARC after

browsing

Page 9: WebRecorder.io Building a new archiving service for everyone!

Current Service

Proof-of-Concept Users can record a page and browse Users can download the WARC after

browsing Users can upload any WARC and replay

Page 10: WebRecorder.io Building a new archiving service for everyone!

Current Service

Proof-of-Concept Users can record a page and browse Users can download the WARC after

browsing Users can upload any WARC and replay No content stored, WARCs deleted after

30 mins.

Page 11: WebRecorder.io Building a new archiving service for everyone!

Current Service

Proof-of-Concept Users can record a page and browse Users can download the WARC after

browsing Users can upload any WARC and replay No content stored, WARCs deleted after

30 mins. Created last year as an experiment

Page 12: WebRecorder.io Building a new archiving service for everyone!

Current Service

Proof-of-Concept Users can record a page and browse Users can download the WARC after browsing Users can upload a WARC and replay back No content stored, WARCs deleted after 30 mins. Created last year as an experiment

You can use at: https://webrecorder.io

Page 13: WebRecorder.io Building a new archiving service for everyone!

New WebRecorder.io Service

First new version up at beta.webrecorder.io for demo

Page 14: WebRecorder.io Building a new archiving service for everyone!

New WebRecorder.io Service

First new version up at beta.webrecorder.io for demo

Initially invite only to monitor capacity

Page 15: WebRecorder.io Building a new archiving service for everyone!

New WebRecorder.io Service

First new version up at beta.webrecorder.io for demo

Initially invite only to monitor capacity User registration, login, individual

collections

Page 16: WebRecorder.io Building a new archiving service for everyone!

New WebRecorder.io Service

First new version up at beta.webrecorder.io for demo

Initially invite only to monitor capacity User registration, login, individual

collections Collections available at

beta.webrecorder.io/<user>/<coll>.

Page 17: WebRecorder.io Building a new archiving service for everyone!

New WebRecorder.io Service

First new version up at beta.webrecorder.io for demo

Initially invite only to monitor capacity User registration, login, individual collections Collections available at

beta.webrecorder.io/<user>/<coll> Collections can be private, public, or shared

privately (coming soon).

Page 18: WebRecorder.io Building a new archiving service for everyone!

Live Demo!

Page 19: WebRecorder.io Building a new archiving service for everyone!

Privacy Concerns

User responsible for their own archive, has full control

Page 20: WebRecorder.io Building a new archiving service for everyone!

Privacy Concerns

User responsible for their own archive, has full control

Collections private by default, but users may choose what to make public

Page 21: WebRecorder.io Building a new archiving service for everyone!

Privacy Concerns

User responsible for their own archive, has full control

Collections private by default, but users may choose what to make public

For now, WARCs downloadable only by owner, though may change.

Page 22: WebRecorder.io Building a new archiving service for everyone!

Privacy Concerns

User responsible for their own archive, has full control

Collections private by default, but users may choose what to make public

For now, WARCs downloadable only by owner, though may change.

May have additional access levels: share read-only, share for recording, etc...

Page 23: WebRecorder.io Building a new archiving service for everyone!

Privacy Concerns

User responsible for their own archive, has full control

Collections private by default, but users may choose what to make public

For now, WARCs downloadable only by owner, though may change.

May have additional access levels: share read-only, share for recording, etc...

Cookies: Cookies are recorded, but not replayed

Page 24: WebRecorder.io Building a new archiving service for everyone!

Privacy Concerns

User responsible for their own archive, has full control Collections private by default, but users may choose

what to make public For now, WARCs downloadable only by owner, though

may change. May have additional access levels: share read-only,

share for recording, etc... Cookies: Cookies are recorded, but not replayed Looking for ideas/better ways to address privacy.

Suggestions welcome!

Page 25: WebRecorder.io Building a new archiving service for everyone!

Goals/Features

Provide a flexible archiving service for high-fidelity web archiving.

Page 26: WebRecorder.io Building a new archiving service for everyone!

Goals/Features

Provide a flexible archiving service for high-fidelity web archiving.

Customizable UI, metadata and annotation support.

Page 27: WebRecorder.io Building a new archiving service for everyone!

Goals/Features

Provide a flexible archiving service for high-fidelity web archiving.

Customizable UI, metadata and annotation support.

On-Demand Full-Text Search.

Page 28: WebRecorder.io Building a new archiving service for everyone!

Goals/Features

Provide a flexible archiving service for high-fidelity web archiving.

Customizable UI, metadata and annotation support.

On-Demand Full-Text Search. Multiple privacy options, custom sharing

settings.

Page 29: WebRecorder.io Building a new archiving service for everyone!

Goals/Features

Provide a flexible archiving service for high-fidelity web archiving.

Customizable UI, metadata and annotation support.

On-Demand Full-Text Search. Multiple privacy options, custom sharing

settings. Multiple backends for storage.

Page 30: WebRecorder.io Building a new archiving service for everyone!

Goals/Features Provide a flexible archiving service for high-fidelity

web archiving.

Customizable UI, metadata and annotation support.

On-Demand Full-Text Search. Multiple privacy options, custom sharing settings. Multiple backends for storage. A version that can also be hosted on custom

hardware, not in “the cloud”

Page 31: WebRecorder.io Building a new archiving service for everyone!

Tools Used in WebRecorder.io

Built with open-source tools pywb – https://github.com/ikreymer/pywb

python wayback – Embedded in the web app, front end web service, handles url rewriting w/ custom rules, WARC reading, live web fetching.

warcprox – https://github.com/internetarchive/warcprox - Created by Noah Levitt of IA, HTTP/S proxy which records HTTP traffic to WARCs

Page 32: WebRecorder.io Building a new archiving service for everyone!

Help Wanted!

Looking for collaborators, developers, UI designers, archivists

Page 33: WebRecorder.io Building a new archiving service for everyone!

Help Wanted!

Looking for collaborators, developers, UI designers, archivists

If you ever wanted to participate in building an archiving service, here is your chance.

Page 34: WebRecorder.io Building a new archiving service for everyone!

Help Wanted!

Looking for collaborators, developers, UI designers, archivists

If you ever wanted to participate in building an archiving service, here is your chance.

Sign-up for mailing list on webrecorder.io or request an invite at beta.webrecorder.io

Also can email [email protected]

Page 35: WebRecorder.io Building a new archiving service for everyone!

Symmetrical Archiving – server and client side url rewriting for record and replay follow same path

Easy Part: HTML url rewritingHard part: JavaScript

Attempt to emulate original JS env as much as possible, customizable client-side hooks

Far from foolproof, Flash, Java applets still problematic.

Addendum: How It Works

Page 36: WebRecorder.io Building a new archiving service for everyone!

Help Wanted!

Looking for collaborators, developers, UI designers, archivists

If you ever wanted to participate in building an archiving service, here is your chance.

Sign-up for mailing list on webrecorder.io or request an invite at beta.webrecorder.io

Also can email [email protected]

Page 37: WebRecorder.io Building a new archiving service for everyone!

“Symmetrical Archiving”

User browses page through /record/ path → Page is recorded to WARC and indexed

User browses page through /replay/ path→ Page is replayed from WARC using index

Attempt symmetry in capture and replay as much as possible.

Assumption: Dynamic content generated for /record/ = Dynamic content generated for /replay/

Page 38: WebRecorder.io Building a new archiving service for everyone!

“Symmetrical Archiving”

/<coll>/record/ path ↔ url rewriting system ↔fetch HTTP data ↔ recording proxy writes WARCs ↔ live web

/<coll>/ path ↔ url rewriting system ↔ fetch HTTP data ↔ read from WARC

Attempt symmetry in capture and replay as much as possible. Recorded content is instantly replayable.

Page 39: WebRecorder.io Building a new archiving service for everyone!

“Symmetrical Archiving”

/<coll>/record/ path ↔ url rewriting system ↔fetch HTTP data ↔ recording proxy writes WARCs ↔ live web

/<coll>/ path ↔ url rewriting system ↔ fetch HTTP data ↔ read from WARC

Url rewriting is the hard part! Actually more like “emulating original page context”

when running through a proxy/recording.

Page 40: WebRecorder.io Building a new archiving service for everyone!

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Page 41: WebRecorder.io Building a new archiving service for everyone!

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Possible Solution: Extensive client side url-rewriting

Page 42: WebRecorder.io Building a new archiving service for everyone!

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Possible Solution: Extensive client side url-rewriting

Checks for window.location or window.top

Page 43: WebRecorder.io Building a new archiving service for everyone!

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Possible Solution: Extensive client side url-rewriting

Checks for window.location or window.top

Page 44: WebRecorder.io Building a new archiving service for everyone!

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Possible Solution: Extensive client side url-rewriting

Checks for window.location or window.top

Page 45: WebRecorder.io Building a new archiving service for everyone!

“When symmetry breaks”

Urls change based on timestamp, or date, eg. ?_=<timestamp>

Possible Solution: Override Date(), server-side “fuzzy matching” ignoring certain query params

Flash video in a custom flash SWF Possible Solution: may be able to force

html5, otherwise youtube-dl may download flash version, and replace with custom player (FlowPlayer)

Page 46: WebRecorder.io Building a new archiving service for everyone!

“When symmetry breaks”

Urls change based on timestamp, or date, eg. ?_=<timestamp>

Possible Solution: Override Date(), server-side “fuzzy matching” ignoring certain query params

Flash video in a custom flash SWF Possible Solution: may be able to force html5,

otherwise youtube-dl may download flash version, and replace with custom player (FlowPlayer)

General black-box Flash content with hard-coded links. Possible Solution: No good one so far! Maybe

shumway.js, a javascript flash player from Mozilla?

Page 47: WebRecorder.io Building a new archiving service for everyone!

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Possible Solution: Extensive client side url-rewriting

Checks for window.location or window.top Possible Solution: Rewrite

window.location → WB_wombat_location , window.top → WB_wombat_top

Page 48: WebRecorder.io Building a new archiving service for everyone!

wombat.js rewriting libraryThe following are some of the possible overrides by wombat.js: AJAX (XmlHTTPRequest.open) window.open History.pushState / replaceState Object.defineProperty() overrides on: document.domain, document.cookie WB_wombat_location emulates to window.location with rewriting (with server-side

rewriting) WB_wombat_top emulate window.top but hides container frame (with server-side

rewriting) Window postMessage() Date() constructor Seed Math.random with capture time document.write() setAttribute() / or mutation observers appendChild() / replaceChild() / insertChild()

Page 49: WebRecorder.io Building a new archiving service for everyone!

pywb

wombat.js is part of pywb, a new open source python “wayback machine” implementation

Optional custom rules can be specified for any site by prefix or regex, specified in yaml file.

Fuzzy matching rules: Specify significant query params No config file required! Out-of-the-box simple collection

management tools for running an archive More details at: https://github.com/ikreymer/pywb Future updates will include improvements to rule

customization.