computational journalism projects

Post on 13-May-2015

251 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation to Duke University computer science students, February 2012, by Sarah Cohen, Knight Professor of the Practice

TRANSCRIPT

Reporterslab.org

Presentation for computational journalism students

February 2012

STRUCTURED DATA.. And most reporters’ inability to deal with it

New York Times reporters used Word searches and annotations to analyze Wikileaks documents in 2010 and 2011.

PANDA project trying to help gather data inside newsrooms

Barriers to Structured data analysis in the newsroom

• Expensive• Too hard to collect.• It takes practice• It takes patience.• Once collected, data has a short shelf life – its

value inside the newsroom effectively ends once a story is published.

Web-scraping software: ephemeral or too expensive for a task not viewed as mission-critical.

Solutions

• User-friendly tool for scraping websites for structured data

• Packages of algorithms from fraud and other forensic fields for use with public records datasets online.

• Packages of queries and statistical tests for money, dates, geographical identifiers, names and codes, presented in standard English

• Tools for fuzzy matching of datasets: include scoring, best match likelihood, interactive machine learning for different datasets.

TOO MUCH MATERIALWith too little information

Too many sources with too little news

• Twitter, Facebook, LinkedIn and other social media• RSS feeds from other news organizations and blogs• Press releases from government agencies or beat

subjects

Lack of archiving is just as troubling as the lack of structure. Reporters can’t hold the powerful accountable without information from the past.

Solutions

• Archiving users’ feeds locally or in the cloud• Mash-up social media, rss feeds into an app

that reveals more insight into the sources• Formalize each reporter’s definition of “news”

through machine learning. • Alerts for important source material. Example:

changing time of a press conference.

UNUSABLE RECORDSThe buried treasure

Solutions

• Visual extractor of data from scanned forms.• Separate scanned boxes of documents into

their pieces for further analysis• Use speech recognition tools on government

audio and video• OCR video to find the speaker at a hearing

ANTIQUATED METHODSFor unstructured data

Our way

• Hand-enter individual items into spreadsheets

• Transcribe interviews, hearings and other audio and video content for searching

• Read each document

A newer way

• Leverage web scraping and paid crowdsourcing for data entry (MT)

• Use speech recognition for the first pass on searchable audio and video

• Use clustering, information extraction and other methods for overview of documents

Reporterslab.org working to tame audio and video

Associated Press project to bring order to unstructured data

Wordseer for historical text

Jigsaw

REPORTERSLAB.ORG

Creating sample data and documents for researchers based on real stories

top related