sn@tch cni fall 2014
TRANSCRIPT
Sn@tch:An Archiving and Analysis
Service for Global News
Todd Grappone @liber8er
Sharon Farb @farbthink
Martin Klein @mart1nkle1n
Peter Broadwell @peterbroadwell
Digital ephemera
collections
• Collected by researchers
• Donated by activists
• Include images, audio,
video, scanned
documents, social media,
server logs
International Collecting
• 829 digitally recorded Iranian dissident news programs
• 9,166 other videos from the Iranian Green Movement
• 29,441 digital photographs from the Green Movement
• 543 documents from Tahrir Square
News and Perspectives
The UCLA NewsScape:
• >228,000 hours of TV news• Recorded 2005-present• 13 countries, 9 languages• 38 networks• Searchable by captions, on-
screen text, named entities• How to incorporate social media
into this variety of perspectives?
A Brief History of Timeliness
• Twitter archive at the Library of Congress [1]
• Last public update from January 4th 2013
• ~170 billion tweets, > 130 TB compressed (late 2012)
• Single search against 2006-2010 data may take up to 24 hours
• Twitter data access at Massachusetts Institute of Technology,
Laboratory for Social Machines [2]
• Public announcement from October 1st 2014
[1] http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archive-at-the-library-of-congress/
[2] https://blog.twitter.com/2014/investing-in-mit-s-new-laboratory-for-social-machines
A Brief History of Timeliness
In case you missed it:
• Twitter makes full archive
of tweets available,
indexed
• Great, problem solved?
• How about deleted
tweets?
• Real-time capture of
embedded resources?
https://blog.twitter.com/2014/building-a-complete-tweet-index
A Brief History of Timeliness
• Many initiatives to capture Twitter data
• Live, after an event, both
• Mostly ad-hoc efforts, rarely institutionalized
• Operation often requires programming or sys admin skills
• Deen Freelon’s (American University) incomplete list of tools:https://docs.google.com/document/d/1UaERzROI986HqcwrBDLaqGG8X_lY
wctj6ek6ryqDOiQ/
A Brief History of TimelinessSocial Feed Manager (Dan Chudnov, GWU); as presented at
#cni13f
http://social-feed-manager.readthedocs.org/
A Brief History of Timelinesstwarc (Ed Summers, MITH); used for Ferguson
data
http://inkdroid.org/journal/2014/08/30/a-ferguson-twitter-archive/http://files.archivists.org/conference/nola2013/twitter/twarc-saa13.htm
We Can
Remember It for
You Wholesale
I. Real-time capture of
tweets plus pro-active
archiving of embedded
resources
II. Rapid analysis, real-
time opportunities
III. Collection-agnostic
linking
Remembrance of Tweets/Links Past
• Utilize GWU’s Social Feed Manager
• Filter by keywords, user handles, location, time, etc
• Store raw tweets
• Extract and archive embedded URIs
• Utilize pro-active archiving solutions: Internet Archive,
archive.today
Remembrance of Tweets/Links Past
• UCLA’ s dataset about Egyptian revolution
• More than 400k tweets
• Approx. 50k unique users
• Tweets originated from within 200 miles around Cairo
Remembrance of Tweets/Links Past
• UCLA’ s dataset about Egyptian revolution
• 25% of tweets contain references to external resources
(web pages, images, videos, etc)
Remembrance of Tweets/Links Past
• UCLA’ s dataset about Egyptian revolution
• 20% of references are dead, after less than 4 years (!!!)
Remembrance of Tweets/Links Past
http://yfrog.com/h02gvclj
HTTP GET
200 OK
HTTP HEAD
204 No Content
Remembrance of Tweets/Links Past
• UCLA’ s dataset about Egyptian revolution
• 20% of references are dead AND
• 60% of these are not archived
http://wayback.archive-it.org/all/20110203083908/http://yfrog.com/h02gvclj
This one
is!
discovered via #memento
Remembrance of Tweets/Links Past
URIs from Ed Summer’s Ferguson
dataset
https://edsu.github.io/ferguson-urls/
pink == not archived
(Internet Archive)
28%
Remembrance of Tweets/Links Past
http://babylon.library.ucla.edu/mklein/archived.html
Part 2: Rapid, Adaptive
Analysis
https://srogers.cartodb.com/viz/64f6c0f4-745d-11e4-
b4e1-0e4fddd5de28/public_map
Part 3: Collection-Agnostic Linking
On TV news: Egypt, Tahrir, Cairo
On Twitter: #jan25, #tahrir, #egypt
Raiders of the Lost Links
Challenges and opportunities:
• Legal frameworks for sharing and preserving tweets and linked
resources
• Collaborations and partnerships to ensure momentum, sustainability
• Expansion to other forms of (social) media
Lazy Digital Archivists: Your Time is Up
Todd Grappone [email protected]
Sharon Farb [email protected]
Martin Klein [email protected]
Peter
Broadwell