webrecorder.io building a new archiving service for everyone!

Post on 19-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

WebRecorder.io

Building a new archiving service for everyone!

What is WebRecorder.io?

On-Demand Archiving through the browser

What is WebRecorder.io?

On-Demand Archiving through the browser

What you see is what you archive (WYSIWYA)

What is WebRecorder.io?

On-Demand Archiving through the browser

What you see is what you archive (WYSIWYA)

Available to anyone!

What is WebRecorder.io?

On-Demand Archiving through the browser.

What you see is what you archive (WYSIWYA)

Available to anyone!

“Quality over Quantity” - High-Fidelity Replay of Web Content

Current Service

Proof-of-Concept

Current Service

Proof-of-Concept Users can record a page and browse

Current Service

Proof-of-Concept Users can record a page and browse Users can download the WARC after

browsing

Current Service

Proof-of-Concept Users can record a page and browse Users can download the WARC after

browsing Users can upload any WARC and replay

Current Service

Proof-of-Concept Users can record a page and browse Users can download the WARC after

browsing Users can upload any WARC and replay No content stored, WARCs deleted after

30 mins.

Current Service

Proof-of-Concept Users can record a page and browse Users can download the WARC after

browsing Users can upload any WARC and replay No content stored, WARCs deleted after

30 mins. Created last year as an experiment

Current Service

Proof-of-Concept Users can record a page and browse Users can download the WARC after browsing Users can upload a WARC and replay back No content stored, WARCs deleted after 30 mins. Created last year as an experiment

You can use at: https://webrecorder.io

New WebRecorder.io Service

First new version up at beta.webrecorder.io for demo

New WebRecorder.io Service

First new version up at beta.webrecorder.io for demo

Initially invite only to monitor capacity

New WebRecorder.io Service

First new version up at beta.webrecorder.io for demo

Initially invite only to monitor capacity User registration, login, individual

collections

New WebRecorder.io Service

First new version up at beta.webrecorder.io for demo

Initially invite only to monitor capacity User registration, login, individual

collections Collections available at

beta.webrecorder.io/<user>/<coll>.

New WebRecorder.io Service

First new version up at beta.webrecorder.io for demo

Initially invite only to monitor capacity User registration, login, individual collections Collections available at

beta.webrecorder.io/<user>/<coll> Collections can be private, public, or shared

privately (coming soon).

Live Demo!

Privacy Concerns

User responsible for their own archive, has full control

Privacy Concerns

User responsible for their own archive, has full control

Collections private by default, but users may choose what to make public

Privacy Concerns

User responsible for their own archive, has full control

Collections private by default, but users may choose what to make public

For now, WARCs downloadable only by owner, though may change.

Privacy Concerns

User responsible for their own archive, has full control

Collections private by default, but users may choose what to make public

For now, WARCs downloadable only by owner, though may change.

May have additional access levels: share read-only, share for recording, etc...

Privacy Concerns

User responsible for their own archive, has full control

Collections private by default, but users may choose what to make public

For now, WARCs downloadable only by owner, though may change.

May have additional access levels: share read-only, share for recording, etc...

Cookies: Cookies are recorded, but not replayed

Privacy Concerns

User responsible for their own archive, has full control Collections private by default, but users may choose

what to make public For now, WARCs downloadable only by owner, though

may change. May have additional access levels: share read-only,

share for recording, etc... Cookies: Cookies are recorded, but not replayed Looking for ideas/better ways to address privacy.

Suggestions welcome!

Goals/Features

Provide a flexible archiving service for high-fidelity web archiving.

Goals/Features

Provide a flexible archiving service for high-fidelity web archiving.

Customizable UI, metadata and annotation support.

Goals/Features

Provide a flexible archiving service for high-fidelity web archiving.

Customizable UI, metadata and annotation support.

On-Demand Full-Text Search.

Goals/Features

Provide a flexible archiving service for high-fidelity web archiving.

Customizable UI, metadata and annotation support.

On-Demand Full-Text Search. Multiple privacy options, custom sharing

settings.

Goals/Features

Provide a flexible archiving service for high-fidelity web archiving.

Customizable UI, metadata and annotation support.

On-Demand Full-Text Search. Multiple privacy options, custom sharing

settings. Multiple backends for storage.

Goals/Features Provide a flexible archiving service for high-fidelity

web archiving.

Customizable UI, metadata and annotation support.

On-Demand Full-Text Search. Multiple privacy options, custom sharing settings. Multiple backends for storage. A version that can also be hosted on custom

hardware, not in “the cloud”

Tools Used in WebRecorder.io

Built with open-source tools pywb – https://github.com/ikreymer/pywb

python wayback – Embedded in the web app, front end web service, handles url rewriting w/ custom rules, WARC reading, live web fetching.

warcprox – https://github.com/internetarchive/warcprox - Created by Noah Levitt of IA, HTTP/S proxy which records HTTP traffic to WARCs

Help Wanted!

Looking for collaborators, developers, UI designers, archivists

Help Wanted!

Looking for collaborators, developers, UI designers, archivists

If you ever wanted to participate in building an archiving service, here is your chance.

Help Wanted!

Looking for collaborators, developers, UI designers, archivists

If you ever wanted to participate in building an archiving service, here is your chance.

Sign-up for mailing list on webrecorder.io or request an invite at beta.webrecorder.io

Also can email info@webrecorder.io

Symmetrical Archiving – server and client side url rewriting for record and replay follow same path

Easy Part: HTML url rewritingHard part: JavaScript

Attempt to emulate original JS env as much as possible, customizable client-side hooks

Far from foolproof, Flash, Java applets still problematic.

Addendum: How It Works

Help Wanted!

Looking for collaborators, developers, UI designers, archivists

If you ever wanted to participate in building an archiving service, here is your chance.

Sign-up for mailing list on webrecorder.io or request an invite at beta.webrecorder.io

Also can email info@webrecorder.io

“Symmetrical Archiving”

User browses page through /record/ path → Page is recorded to WARC and indexed

User browses page through /replay/ path→ Page is replayed from WARC using index

Attempt symmetry in capture and replay as much as possible.

Assumption: Dynamic content generated for /record/ = Dynamic content generated for /replay/

“Symmetrical Archiving”

/<coll>/record/ path ↔ url rewriting system ↔fetch HTTP data ↔ recording proxy writes WARCs ↔ live web

/<coll>/ path ↔ url rewriting system ↔ fetch HTTP data ↔ read from WARC

Attempt symmetry in capture and replay as much as possible. Recorded content is instantly replayable.

“Symmetrical Archiving”

/<coll>/record/ path ↔ url rewriting system ↔fetch HTTP data ↔ recording proxy writes WARCs ↔ live web

/<coll>/ path ↔ url rewriting system ↔ fetch HTTP data ↔ read from WARC

Url rewriting is the hard part! Actually more like “emulating original page context”

when running through a proxy/recording.

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Possible Solution: Extensive client side url-rewriting

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Possible Solution: Extensive client side url-rewriting

Checks for window.location or window.top

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Possible Solution: Extensive client side url-rewriting

Checks for window.location or window.top

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Possible Solution: Extensive client side url-rewriting

Checks for window.location or window.top

“When symmetry breaks”

Urls change based on timestamp, or date, eg. ?_=<timestamp>

Possible Solution: Override Date(), server-side “fuzzy matching” ignoring certain query params

Flash video in a custom flash SWF Possible Solution: may be able to force

html5, otherwise youtube-dl may download flash version, and replace with custom player (FlowPlayer)

“When symmetry breaks”

Urls change based on timestamp, or date, eg. ?_=<timestamp>

Possible Solution: Override Date(), server-side “fuzzy matching” ignoring certain query params

Flash video in a custom flash SWF Possible Solution: may be able to force html5,

otherwise youtube-dl may download flash version, and replace with custom player (FlowPlayer)

General black-box Flash content with hard-coded links. Possible Solution: No good one so far! Maybe

shumway.js, a javascript flash player from Mozilla?

“When symmetry breaks”

JavaScript generated content, “leaks” to live web

Possible Solution: Extensive client side url-rewriting

Checks for window.location or window.top Possible Solution: Rewrite

window.location → WB_wombat_location , window.top → WB_wombat_top

wombat.js rewriting libraryThe following are some of the possible overrides by wombat.js: AJAX (XmlHTTPRequest.open) window.open History.pushState / replaceState Object.defineProperty() overrides on: document.domain, document.cookie WB_wombat_location emulates to window.location with rewriting (with server-side

rewriting) WB_wombat_top emulate window.top but hides container frame (with server-side

rewriting) Window postMessage() Date() constructor Seed Math.random with capture time document.write() setAttribute() / or mutation observers appendChild() / replaceChild() / insertChild()

pywb

wombat.js is part of pywb, a new open source python “wayback machine” implementation

Optional custom rules can be specified for any site by prefix or regex, specified in yaml file.

Fuzzy matching rules: Specify significant query params No config file required! Out-of-the-box simple collection

management tools for running an archive More details at: https://github.com/ikreymer/pywb Future updates will include improvements to rule

customization.

top related