archive what i see now
DESCRIPTION
Archive What I See Now. Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University { mkelly,mln,mweigle}@cs.odu.edu Web Science and Digital Libraries Research Group ws-dl.blogspot.com. What’s the Problem?. Web archives capture a lot but not everything - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/1.jpg)
Archive What I See Now
Mat Kelly, Michael L. Nelson, Michele C. WeigleOld Dominion University
{mkelly,mln,mweigle}@cs.odu.edu
Web Science and Digital Libraries Research Groupws-dl.blogspot.com
![Page 2: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/2.jpg)
2
• Web archives capture a lot but not everything• Individuals’ interests may not be captured• Timely capture is important• Capture capability must
be enabled for all
What’s the Problem?
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
![Page 3: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/3.jpg)
3November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
Use Case: Capturing Breaking Stories• Calls for seed URIs
are reactionary• Not quick enough
for rapidlyevolving events
![Page 4: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/4.jpg)
4November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
Use Case: Capturing Breaking Stories• Intermediate
mementos missed• The story is
incomplete
![Page 5: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/5.jpg)
5November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
Use Case: Capturing Breaking Stories
![Page 6: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/6.jpg)
6
• Users take ad hoc approaches– – Screenshots of Pages
• Why? Tools are hard.– Build more accessible tools– Appeal to standards (e.g., WARC)– Make interoperable
The Amateur Archivist’s Approach
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
28500:2009
![Page 7: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/7.jpg)
7
• Safety of Archives Requires $• No $, No Institution• Users Hard Drives Fail– No Access to Save-As files
and Screenshots• A hybrid approach is needed
to leverage institutional safety, formats, and techwhile still allowing direct user deposits
The Institutional Dilemma
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
![Page 8: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/8.jpg)
8
• Show use case where other tools cannot capture– e.g., behind authentication– Juxtapose to Archive.is, Webcite, Save webpage As
Video Here
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
![Page 9: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/9.jpg)
10
So we built it!
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
WARCreate – Google Chrome extension• Create web archives from browser• Capture personalized content• Preserve on a whim
1. Mat Kelly and Michele C., "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2012). Washington, DC, June 2012, pp. 437-438
2. Mat Kelly, Michele C. Weigle , Michael Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools Demo Session: Web Archiving; 2012 Jul 25; Washington, DC.
![Page 10: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/10.jpg)
11
WARCreate – How it Works
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
![Page 11: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/11.jpg)
12
Preserving the Original Context
Facebook-Supplied Data DumpArchive created from
WARCreate in Wayback
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
Liberated Data Doesn’t Give The Whole Picture
![Page 12: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/12.jpg)
13
Preserving the Original Context
Using Scraping Tools (e.g. wget)Archive created from
WARCreate in Wayback
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
The Target Controls What is Allowed
![Page 13: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/13.jpg)
14
Preserving the Original Context
A Crawler Has No ContextArchive created from
WARCreate in Wayback
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
No Credentials No Entry No Archiving
![Page 14: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/14.jpg)
15
Preserving the Original Context
IA/HERITRIX OBEY ROBOTSArchive created from
WARCreate in Wayback
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
No Means No, if They Say and you Obey
![Page 15: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/15.jpg)
16
PROBLEM: Users don’t know what to do with WARCs
So we built it!
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
WARCreate – Google Chrome extension• Create web archives from browser• Capture personalized content• Preserve on a whim
1. Mat Kelly and Michele C., "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2012). Washington, DC, June 2012, pp. 437-438
2. Mat Kelly, Michele C. Weigle , Michael Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools Demo Session: Web Archiving; 2012 Jul 25; Washington, DC.
![Page 16: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/16.jpg)
17
So, again, we built it!
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
Web Archiving Integration Layer (WAIL)• Heritrix, Wayback, etc. packaged for PC• GUI front-end allows “One-Click Preservation”• Provides means to replay WARCs
1. Mat Kelly, Michele C. Weigle, Michael Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital Archiving 2013, Poster Session; 2013 Feb 21; College Park, MD.
2. Mat Kelly, Michael Nelson and Michele C. Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013, Workshops and Sessions: Web Archiving; 2013 Jul 24; Alexandria, VA
![Page 17: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/17.jpg)
18
PROBLEM: Users want to preserve but store at institutions for safe keeping
So, again, we built it!
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
Web Archiving Integration Layer (WAIL)• Heritrix, Wayback, etc. packaged for PC• GUI front-end allows “One-Click Preservation”• Provides means to replay WARCs
1. Mat Kelly, Michele C. Weigle, Michael Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital Archiving 2013, Poster Session; 2013 Feb 21; College Park, MD.
2. Mat Kelly, Michael Nelson and Michele C. Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013, Workshops and Sessions: Web Archiving; 2013 Jul 24; Alexandria, VA
PROBLEM: Even with replay, not everyone wants to use Chrome
![Page 18: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/18.jpg)
19
The Plan
1. Port
2. Add functionality in: …to upload WARCs to:
3. Implement Sequential Archiving
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
&
&
![Page 19: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/19.jpg)
20
• Disjoint extension/add-on APIs– Little logic can be re-used
• Problems with HTTP header capture in Chrome are trivial in Firefox– Chrome = highly asynchronous fetching
• JavaScript code to save to local file system from Chrome for WARCreate is re-usable
Porting WARCreate to Firefox
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
![Page 20: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/20.jpg)
21
The Plan
1. Port
2. Add functionality in: …to upload WARCs to:
3. Implement Sequential Archiving
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
&
&
✓ In βeta now!
![Page 21: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/21.jpg)
22
The Plan
1. Port
2. Add functionality in: …to upload WARCs to:
3. Implement Sequential Archiving
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
&
&
![Page 22: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/22.jpg)
23
• Working with Archive-It to determine feasibility of user-provided WARCs
• Consideration of data integrity• Should data be merged with A-IT crawled
WARCs? – How do we account for
your www.facebook.com vs. my www.facebook.com
• Privacy?
Uploading WARCs:An Open Question
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
![Page 23: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/23.jpg)
24
The Plan
1. Port
2. Add functionality in: …to upload WARCs to:
3. Implement Sequential Archiving
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
&
&
![Page 24: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/24.jpg)
25
• Like focused crawl but URIs defined on per-site basis to be comprehensive– Similar to Archive Facebook but generalized
• Implement into WARCreate• Utilize per-site specification to
keep tools from breaking★personal stream wall posts my tweets
global stream news feed streams followees’ tweets
multimedia-photos photos photos N/A
multimedia-videos videos videos N/A
photo collection albums N/A N/A
posts notes N/A N/A
friends friends circles following
Sequential Archiving?
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
The Digital Libraries Approach★
Discovery & Scraping:The Information Retrieval Approach
- versus -
![Page 25: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/25.jpg)
26
• Only (and optionally) applied on recognized sites with scraping as fallback for establishing hierarchy
• Lives online, tools allude to and are always updated
• Standardized spec* prototype is live online
Sequential Archiving = Lots of Maintenance
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
* M. Kelly, An Extensible Framework for Creating Personal Archives of Web Resources Requiring Authentication, Aug 2012
![Page 26: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/26.jpg)
27
• Firefox WARCreate in Beta– Chrome WARCreate Users Can Currently
Archive What They See Now• Sequential Archiving Implemented in Chrome
WARCreate, needs porting• Next Big Hurdle: Working with Archive-It in
WARC upload logistics
Summary
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
![Page 27: Archive What I See Now](https://reader035.vdocuments.us/reader035/viewer/2022062520/56816326550346895dd39f2d/html5/thumbnails/27.jpg)
28
• Download Our Archiving Tools!
• Share Your Use Cases for Capturing the Unpreserved and the Unpreservable
• Help Us Improve Our Tools, Give Feedback!http://bit.ly/wc-wail
Archive What I See Now
November 12, 2013Salt Lake City, Utah 2013 Archive-It Partner Meeting
In BetaAvailable Soon!
Web Archiving Integration Layer (WAIL)One-Click Preservation Heritrix, Wayback and Others On Your PC!
WARCreate for ChromeCreate WARC files form any web page
from your browser