linking born-digital news and social media collections via automated entity detection and authority...
TRANSCRIPT
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
Linking Born-Digital News and Social Media Collections via Automated Entity Detection and Authority Matching
DLF Forum, Vancouver, 28 October 2015
Linking Born-Digital News and Social Media Collections
via Automated Entity Detection and
Authority Matching
Peter Broadwell@peterbroadwell
Martin Klein@mart1nkle1n
University of California Los AngelesResearch Library
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
2
• Collected by researchers• Donated by activists• Diverse in format:
• Images, audio, video, scanned documents, social media, web server logs
Collections related to news events
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
3
• 829 digitally recorded Iranian dissident news programs• 9,166 other videos from the Iranian Green Movement• 29,441 digital photographs from the Green Movement• 543 documents from Tahrir Square
International Digitizing Ephemera
http://digital.library.ucla.edu/dep/
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
4
• Tahrir Square Egypt & Libya unrest, 2011• Tōhoku earthquake and tsunami, Japan, 2011• AirAsia 8501 crash, December 2014• Charlie Hebdo shooting, January 2015
International Digitizing Ephemera – Tweets
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
5
• Tahrir Square Egypt & Libya unrest, 2011• Tōhoku earthquake and tsunami, Japan, 2011• AirAsia 8501 crash, December 2014• Charlie Hebdo shooting, January 2015
International Digitizing Ephemera – Tweets
Social Feed Managerhttp://social-feed-manager.readthedocs.org/
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
6
NewsScape
• 264,000 hours of TV news archived digitally• Recorded 2005-present, ca. 100 shows/day• 13 countries, 9 languages• 38 networks• Searchable by captions, on-screen text, official transcripts
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
7
NewsScape
• 264,000 hours of TV news archived digitally• Recorded 2005-present, ca. 100 shows/day• 13 countries, 9 languages• 38 networks• Searchable by captions, on-screen text, official transcripts
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
8
NewsScape
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
9
Social Local Global
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
10
Linking social media, TV news, and web news
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
11
Linking social media, TV news, and web newsCollection on AirAsia QZ8501 crash on 12/28/2014, recorded TV and social media through 1/17/2015• 7.3 million tweets containing #AirAsia or #QZ8501• 1.3 million distinct users• 262 distinct television recordings• 1,535 on-air mentions of AirAsia or [QZ]8501• ~3,000 on-screen appearances of AirAsia or [QZ]8501
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
12
Linking social media, TV news, and web newsCollection on AirAsia QZ8501 crash on 12/28/2014, recorded TV and social media through 1/17/2015• 7.3 million tweets containing #AirAsia or #QZ8501• 1.3 million distinct users• 262 distinct television recordings• 1,535 on-air mentions of AirAsia or [QZ]8501• ~3,000 on-screen appearances of AirAsia or [QZ]8501
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
13
Linking via Automated Entity Detection
• Discover and highlight commonalities and relationships between disjoint collections on related news events• Link to authorities• Address problem of disambiguation
• Establish workflow for automatic linking • Integration with search and discovery interfaces• Exposure via APIs
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
14
CNN09/16/201505:22pm
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
15
CNN09/16/201505:22pm
Twitter09/16/2015
06:22pm
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
16
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
17
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
18
Experiment 1/3
• Apply DBpedia Spotlight Named Entity Recognition (NER) software to collections on second GOP presidential primary debate on 09/16/2015• Twitter: 800,000 tweets• TV: CNN coverage of debate• Minute granularity• Persons, Organizations, Places
Results:• Linked entities with URIs to DBpedia resources• Visualization of correlations between entities
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
19
Experiment 1/3 - Persons
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
20
Experiment 1/3 - Places
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
21
Experiment 1/3 - Organizations
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
22
Experiment 2/3
• Expanded range of TV news coverage up to 4 days after the debate on 17 local, U.S., and international channels
Results:• Discovery of related news shows by matching terms
and entities from Twitter• Visualization highlighting degree of relationships
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
23
Experiment 2/3 – Terms matched
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
24
Experiment 2/3 – Persons matched
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
25
Experiment 3/3
• Automatic geocoding of extracted place names from Twitter and CNN coverage
Results:• Using geographical proximity to explore potentially
relevant correlations • Visualization of places/regions and their frequency of
reference in each collection
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
26
Experiment 3/3 – Twitter
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
27
Experiment 3/3 – NewsScape
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
28
Next steps
• Apply techniques to other DL collections• Explore special domain customization of NER tools and
authority matching• Investigate methods to quantify collection overlap• Incorporate more linked open data ontologies• Improve support of other languages
@peterbroadwell and @mart1nkle1nLinking Born-Digital News and Social Media Collections via
Automated Entity Detection and Authority MatchingDLF Forum, Vancouver, 28 October 2015
Linking Born-Digital News and Social Media Collections via Automated Entity Detection and Authority Matching
DLF Forum, Vancouver, 28 October 2015
Linking Born-Digital News and Social Media Collections
via Automated Entity Detection and
Authority Matching
Peter Broadwell@peterbroadwell
Martin Klein@mart1nkle1n
University of California Los AngelesResearch Library