![Page 1: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/1.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
WARCreateCreate Wayback-Consumable WARC Files from Any Webpage
Mat Kelly, Michele C. Weigle, Michael L. Nelson{mkelly,mweigle,mln}@cs.odu.edu
Old Dominion University; Norfolk, VA
![Page 2: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/2.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
2
What is WARCreate?
• Google Chrome extension• Creates WARC files• Enables preservation by users from their
browser• First steps in bringing Institutional
Archiving facilities to the PC
![Page 3: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/3.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
3
Target Content
• Unreachable by web crawlers– Behind authentication– Not listed in search engines (Deep Web)
• Private– We don’t want our bank statements in Wayback
• Non-pertinent to public– Others have little interest in our Facebook
comments
![Page 4: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/4.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
4
Preserving More!
• Much digital information is needlessly lost
• User chooses what they deem important
• Compatible with standard archiving tools.
![Page 5: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/5.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
5
WYSIWYG
Facebook-Supplied Data DumpArchive created from
WARCreate in Wayback
![Page 6: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/6.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
6
WYSIWYG
Using Scraping Tools (e.g. wget)Archive created from
WARCreate in Wayback
![Page 7: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/7.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
7
WYSIWYG
A Crawler Has No ContextArchive created from
WARCreate in Wayback
![Page 8: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/8.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
8
WYSIWYG
IA/HERITRIX OBEY ROBOTSArchive created from
WARCreate in Wayback
![Page 9: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/9.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
9
Goals
• Make it easy to use (GUI-based, no cmd line)• Make it useful (fill the need)• Demonstrate novelty of browser-instigated
preservation• Show value of WARC format for Personal Web
preservation• Bring WARC format to Personal Digital
Archiving
![Page 10: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/10.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
11Creating a WARC
![Page 11: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/11.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
12
I’ve Made a WARC. Now what?
• What you do with the archive is up to you.– Install it in your local Wayback instance
• Who has their own Wayback Instance!?– Wayback is free & open source
• That seems like a lot of work!– One additional reason for users NOT to preserve
what they would like archived
![Page 12: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/12.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
13
…to directory accessible to local wayback
6
WARC Creation & Replay
1. User visits a website using their browser
1
2
4
3
2. WARCreate captures the HTTP Headers3. User Selects “Generate WARC” button in WARCreate4. WARC generated, saved locally
5
5. Local Wayback instance indexes WARC6. User accesses local wayback to view preserved content
![Page 13: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/13.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
14
Suite Installation & Interaction
• Drag & Drop .zip to hd
• Start relevant servicesusing GUI
• Execute WARCreate process
• View Archive at http://localhost/wayback
![Page 14: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/14.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
15
Replay of Preserved Twitter page
![Page 15: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/15.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
16
And My Bank Statements?
• Preserved content:– never leaves WARC files– never leaves local machine
• WARCreate provides preliminary encoding/encryption support
• Wayback instance is hosted on your own machine – no external access by default
![Page 16: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/16.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
18
Why Use a Client-Side Server?
• Server scripts do what JS can’t• Can reside on your machine!• Controls are GUI based• Resource fetching w/o XSS issues
Local Wayback InstanceWARCreate Server-Side
Support
Memento Proxy
… Tomcat Apache
XAMPP-Based Personal Web Archiving Suite
Built On
![Page 17: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/17.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
19
Extras: Memento Support
• Suite’s includes tailored Timegate
• Memento abstraction is beyond WARC
• Point MementoFox (or other Memento tools) to localhost
![Page 18: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/18.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
20
How it All Relates
WARCreate
BROWSER
MementoFox
Browser Extensions
WARC/1.0WARC-Type: warcinfo WARC-Date: 2012-07-15T22:15:59.485ZWARC-Filename: 2220471175c820fee3fec986040ebd1f.warc
Generates WARC file
LocalTimegate
LocalWaybackInstance
Send Desired Date
Index WARCs
Memento negotiated& returned
Personal Archives Accessible at localhost
![Page 19: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/19.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
21
Contribution of Work
• Facilitate browser-based Personal Web Archiving
• Determine feasibility of fully Client-Side Preservation
• Integrate with existing tools for establishing use cases
![Page 20: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/20.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
22
WARCreateCreate Wayback-Consumable WARC Files from Any Webpage
http://WARCreate.com
![Page 21: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/21.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
23
Backup Slides
![Page 22: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/22.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
24
Future Work
• Decouple from “server”• Refine Memento integration• Reference full WARC spec• Built-in WARC validation• Built-in replay• Compression• Optimization (removing duplicates)• …many more
![Page 23: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/23.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
25
Extras: Configuration Sanity Check
• Server scipts make up for Javascript shortcomings
• The server can reside on your machine!
• Setup,Start,Stop are GUI based
✗✗✗✗✗
WARC ValidationAJAX XSS CircumventionHTML5 Sandbox EscapingMemento SupportLocal Wayback Instance
In WARCreate
![Page 24: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/24.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
26
Extras: Configuration Sanity Check
+ Apache allows generatedWARCs to be validated
+ Javascript cannot write todisk, server-side scripts can
+ Server prevents hot-linking & has security
= Content better preserved using server techs
✓✓✓ ? ✗
WARC ValidationAJAX XSS CircumventionHTML5 Sandbox EscapingMemento SupportLocal Wayback Instance
In WARCreate
![Page 25: July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649e015503460f94aeb50a/html5/thumbnails/25.jpg)
July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com
27
Extras: Configuration Sanity Check
• Memento requires Wayback Wayback requires Tomcat
∴ Memento requires Tomcat• Memento Timegate req’s
Python+modules(pre-packaged + included)
✓✓✓✓✓
WARC ValidationAJAX XSS CircumventionHTML5 Sandbox EscapingMemento SupportLocal Wayback Instance
In WARCreate