detecting patterns in news media content

91
Detecting Patterns in News Media Content Ilias Flaounas University of Bristol January 19, 2010 I. Flaounas (University of Bristol) January 19, 2010 1 / 57

Upload: ilias-flaounas

Post on 18-Aug-2015

36 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Detecting Patterns in News Media Content

Detecting Patterns in News Media Content

Ilias Flaounas

University of Bristol

January 19, 2010

I. Flaounas (University of Bristol) January 19, 2010 1 / 57

Page 2: Detecting Patterns in News Media Content

Overview

1 Introduction

2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation

3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere

4 Conclusions

I. Flaounas (University of Bristol) January 19, 2010 2 / 57

Page 3: Detecting Patterns in News Media Content

Introduction

The global media system has animportant role in democracy,commerce and culture.

News outlets form a vast,complex and interconnectedinformation system.

The system operates in a globalscale, with information beinggenerated and gathered,processed, and distributed invarious ways before it reachesthe final users.

I. Flaounas (University of Bristol) January 19, 2010 3 / 57

Page 4: Detecting Patterns in News Media Content

Terminology

News Outlet or News-media: a source that reports news such as anewspaper, a journal, a TV or Radio station...

News-item or article: a news piece reported in a news outlet thatrefers to a specific event.

Story: a collection of news items that refer to the same event.

Mediaspere: the collective ecology of the world’s media.

Corpus: a collection of news items.

Coding: the manual annotation of news-items.

I. Flaounas (University of Bristol) January 19, 2010 4 / 57

Page 5: Detecting Patterns in News Media Content

Traditional approach

Analysis of news-media content is a domain of research of social scientists.But they have many limitations.

few outlets per study (< 10)

small numbers of news-items (few hundreds in best cases)

small time periods (few days)

news-items from a single country’s media

manual annotation – ‘coding’

they rely on commercial databases such as LexisNexis and theirconstrains.

research is fully hypothesis driven

I. Flaounas (University of Bristol) January 19, 2010 5 / 57

Page 6: Detecting Patterns in News Media Content

Examples of traditional studies

Papers published in recent issues of the Journal of Communication:

“A total of 529 stories from NBC Nightly News and 322 stories airedon Special Report about Iraq, and 64 and 47, respectively, aboutAfghanistan were analysed by two coders”S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War

Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010).

I. Flaounas (University of Bristol) January 19, 2010 6 / 57

Page 7: Detecting Patterns in News Media Content

Examples of traditional studies

Papers published in recent issues of the Journal of Communication:

“A total of 529 stories from NBC Nightly News and 322 stories airedon Special Report about Iraq, and 64 and 47, respectively, aboutAfghanistan were analysed by two coders”S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War

Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010).

“Our corpus of data consisted of Channel 2s broadcasts on the eve ofMDHH between 7:30 p.m. and midnight in the years 1994-2007[...].All 278 items aired on the 14 examined evenings were coded.”O. Meyers et al. “Prime Time Commemoration: An Analysis of Television

Broadcasts on Israel’s Memorial Day for the Holocaust and the Heroism”, J. of

Com. 59, 456-480 (2009).

I. Flaounas (University of Bristol) January 19, 2010 6 / 57

Page 8: Detecting Patterns in News Media Content

Example of Coding Scheme

These questionnaires haveto be completedmanually.

The same questionnairehas to be completed bymore than one coder forthe same news items.

This is a fully hypothesisdriven research model.

I. Flaounas (University of Bristol) January 19, 2010 7 / 57

Page 9: Detecting Patterns in News Media Content

But nowadays...

Most media offer their content online in a convenient form.

I. Flaounas (University of Bristol) January 19, 2010 8 / 57

Page 10: Detecting Patterns in News Media Content

Research Focus

In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.

I. Flaounas (University of Bristol) January 19, 2010 9 / 57

Page 11: Detecting Patterns in News Media Content

Research Focus

In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.

‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.

I. Flaounas (University of Bristol) January 19, 2010 9 / 57

Page 12: Detecting Patterns in News Media Content

Research Focus

In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.

‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.

‘Traditional news-media’ since we do not focus on modern online-onlynews spreading means such as blogs or Twitter.

I. Flaounas (University of Bristol) January 19, 2010 9 / 57

Page 13: Detecting Patterns in News Media Content

Research Focus

In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.

‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.

‘Traditional news-media’ since we do not focus on modern online-onlynews spreading means such as blogs or Twitter.

‘Textual’ since we use only the textual information of news ratherthan analysing e.g. images, videos, or speech.

I. Flaounas (University of Bristol) January 19, 2010 9 / 57

Page 14: Detecting Patterns in News Media Content

Research Focus

In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.

‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.

‘Traditional news-media’ since we do not focus on modern online-onlynews spreading means such as blogs or Twitter.

‘Textual’ since we use only the textual information of news ratherthan analysing e.g. images, videos, or speech.

‘Automated’ in the sense that the analysis is performed by applyingArtificial Intelligence techniques rather than using human ‘coders’.

I. Flaounas (University of Bristol) January 19, 2010 9 / 57

Page 15: Detecting Patterns in News Media Content

Relevant Work & Datasets

Europe Media Monitor (EMM)

‘Lydia’ system

Newsblaster

NewsInEssence

Google News, Yahoo! News

LexisNexis

Public Corpora: Reuters, New York Times

I. Flaounas (University of Bristol) January 19, 2010 10 / 57

Page 16: Detecting Patterns in News Media Content

Relevant Work & Datasets

Europe Media Monitor (EMM)

‘Lydia’ system

Newsblaster

NewsInEssence

Google News, Yahoo! News

LexisNexis

Public Corpora: Reuters, New York Times

We are highly interested in studying the media system per se.

I. Flaounas (University of Bristol) January 19, 2010 10 / 57

Page 17: Detecting Patterns in News Media Content

Overview

1 Introduction

2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation

3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere

4 Conclusions

I. Flaounas (University of Bristol) January 19, 2010 11 / 57

Page 18: Detecting Patterns in News Media Content

Automating News Content Analysis

Some automation is possible using AI approaches:◮ Machine Learning◮ Data Mining◮ Natural Language Processing

Some questions about media system can be answered for the firsttime.

We apply methods that work efficiently and reliably on large-scaledata.

We can attempt a data-driven research model.

I. Flaounas (University of Bristol) January 19, 2010 12 / 57

Page 19: Detecting Patterns in News Media Content

Methods Summary

RSS parsing, Web page content scrapping

Text Preprocessing: Stemming, stop-words removal, TF-IDF, ...

Tagging: Support Vector Machines.

Clustering: Best Reciprocal Hit

Ranking: SVM rank

Words selection: Lasso, SVMs

Network reconstruction: χ2-test ...

NLP: Statistical Machine Translation, Sentiment Analysis, Readability

Data Visualisation: Multidimensional Scaling, Spring Embedding, ...

Statistics: correlations, significance tests...

I. Flaounas (University of Bristol) January 19, 2010 13 / 57

Page 20: Detecting Patterns in News Media Content

Building & Annotating the Corpus

Our corpus in numbers:

> 1300 multilingual news sources

> 3000 news feeds

133 countries

22 languages

> 3 years of continuous monitoring

40K news items per day

> 30M news items in total

I. Flaounas (University of Bristol) January 19, 2010 14 / 57

Page 21: Detecting Patterns in News Media Content

NOAM: News Outlets Analysis & Monitoring system

I. Flaounas (University of Bristol) January 19, 2010 15 / 57

Page 22: Detecting Patterns in News Media Content

NOAM: News Outlets Analysis & Monitoring system

NOAM enable us to query the corpus at semantic level.

I. Flaounas (University of Bristol) January 19, 2010 15 / 57

Page 23: Detecting Patterns in News Media Content

Statistical Machine Translation1

We applied a phrase based Statistical Machine Translation (SMT)approach for translating the non-English articles to English.

We use Moses, a complete phrase based translation toolkit foracademic purposes.

We translate all non-English articles of 21 EU languages into English.

For each language pair, an instance of Moses is trained using Europarldata and JRC-Acquis Multilingual Parallel Corpus.

We make the working assumption that SMT does not altersignificantly the geometry of the news corpus in the vector-spacerepresentation.

1Acknowledgements to Marco Turchi for implementing the SMT module.I. Flaounas (University of Bristol) January 19, 2010 16 / 57

Page 24: Detecting Patterns in News Media Content

Articles Per Day

0 50 100 150 200 2500.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2x 10

4

#Days starting from June 1st, 2009

#Art

icle

s

We observe:

A seven days cycle

Local minima during weekend days.

I. Flaounas (University of Bristol) January 19, 2010 17 / 57

Page 25: Detecting Patterns in News Media Content

Clustering Articles

The Best Reciprocal Hit method:

I. Flaounas (University of Bristol) January 19, 2010 18 / 57

Page 26: Detecting Patterns in News Media Content

Outlets Per Story

100

101

102

101

102

103

104

105

106

#Outlets per Story.

#Sto

ries.

Few stories are covered by lots of media, and lots of stories are covered byfew media.

I. Flaounas (University of Bristol) January 19, 2010 19 / 57

Page 27: Detecting Patterns in News Media Content

The Global Mediasphere

I. Flaounas (University of Bristol) January 19, 2010 20 / 57

Page 28: Detecting Patterns in News Media Content

The Global Mediasphere

543 nodes, 4783 edges, colour by countryI. Flaounas (University of Bristol) January 19, 2010 20 / 57

Page 29: Detecting Patterns in News Media Content

Support Vector Machines as Topic Taggers

We train on data from two well accepted corpora:◮ Reuters◮ NY Times

Typical text preprocessing: Stemming, stop-words removal,bag-of-words (TF-IDF) representation...

Two-class SVMs

Cosine kernel

Maximize F0.5-Score on unseen data

Train to recognise 14 interesting news topics

I. Flaounas (University of Bristol) January 19, 2010 21 / 57

Page 30: Detecting Patterns in News Media Content

SVM Taggers

Topic Corpus F0.5-Score Precision Recall

1 SPORTS Reuters 97.78 98.31 95.752 MARKETS Reuters 92.02 94.09 84.633 FASHION Reuters 83.88 94.61 71.274 DISASTERS Reuters 83.4 87.69 70.345 ART NY Times 81.67 84.9 71.386 BUSINESS NY Times 81.16 86.23 65.877 INFLATION-PRICES Reuters 77.01 81.45 63.388 RELIGION NY Times 74.95 83.57 53.599 POLITICS NY Times 73.81 76.65 64.8110 SCIENCE Reuters 73.63 83.72 50.6211 WEATHER Reuters 71.43 82.91 46.8412 PETROLEUM Reuters 70.67 75.14 58.7313 ELECTIONS Reuters 70.32 78.99 49.3214 ENVIRONMENT NY Times 64.29 73.48 43.7

I. Flaounas (University of Bristol) January 19, 2010 22 / 57

Page 31: Detecting Patterns in News Media Content

Demo: Found In Translation

We implemented a demo to demonstrate the state of the art in variousdisciplines of modern Artificial Intelligence.

We compare the EU countries according to what topics their mediachoose to cover.

Everyday we machine-translate 640 EU media content

Annotate them using SVMs

Compare EU countries media content based on their Top-10 media

I. Flaounas (University of Bristol) January 19, 2010 23 / 57

Page 32: Detecting Patterns in News Media Content

Demo: Found In Translation

http://foundintranslation.enm.bris.ac.uk

I. Flaounas (University of Bristol) January 19, 2010 24 / 57

Page 33: Detecting Patterns in News Media Content

Overview

1 Introduction

2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation

3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere

4 Conclusions

I. Flaounas (University of Bristol) January 19, 2010 25 / 57

Page 34: Detecting Patterns in News Media Content

Detection of Biases

The goal is to measure typical biases among different topics as they arepresented in the media:

Readability

Linguistic Subjectivity

Popularity

Gender Bias

Corpus:

500 English-language media

10 months, (Jan. 1st, 2010 – Oct 31st, 2011)

2.5M articles appeared in main feed

I. Flaounas (University of Bristol) January 19, 2010 26 / 57

Page 35: Detecting Patterns in News Media Content

Readability

We measure readability basedon the Flesch Reading EaseTest

The higher the FRET theeasier the text to read.

Scores range from 0–100.

10K items per topic

FRET (article) = 206.835 − (1.015 · ASL)− 84.6 · ASW

I. Flaounas (University of Bristol) January 19, 2010 27 / 57

Page 36: Detecting Patterns in News Media Content

Linguistic Subjectivity

We measure thepercentage of sentimentaladjectives over the totalnumber of adjectives.

Adjectives detection byStanford POS tagger.

We check for eachadjective the presence of aSentiWordnet sentimentalscore > 0.25.

I. Flaounas (University of Bristol) January 19, 2010 28 / 57

Page 37: Detecting Patterns in News Media Content

Validation of Linguistic Subjectivity?

This is a challenge due to miss of a golden standard.

I. Flaounas (University of Bristol) January 19, 2010 29 / 57

Page 38: Detecting Patterns in News Media Content

Validation of Linguistic Subjectivity?

This is a challenge due to miss of a golden standard.A subset of leading news-outlets, ranked by their linguistic subjectivity.

Rank Outlet

1 BBC2 Times3 NY Times4 The Guardian5 CBS6 Daily Telegraph7 Daily Star(T)8 Independent9 Daily Mail (T)10 Daily Mirror (T)11 Newsweek12 The sun (T)

I. Flaounas (University of Bristol) January 19, 2010 29 / 57

Page 39: Detecting Patterns in News Media Content

Popularity

We measure the conditionalprobability of an article tobecome popular given itstopic.

We track 16 English languageoutlets that provide a “Mostpopular” feed.

◮ In total 108,516 articleswere popular.

◮ 36,788 articles werepopular and appeared inthe main feed.

P(Pop/Topic) =P(Topic/Pop) · P(Pop)

P(Topic)

I. Flaounas (University of Bristol) January 19, 2010 30 / 57

Page 40: Detecting Patterns in News Media Content

Scatter plot of Topics

0 0.5 1 1.5 2 2.536

38

40

42

44

46

48

50

52

ART

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGIONSCIENCE

SPORTS

WEATHER

Popularity

Rea

dabi

lity

I. Flaounas (University of Bristol) January 19, 2010 31 / 57

Page 41: Detecting Patterns in News Media Content

Scatter plot of Topics

0 0.5 1 1.5 2 2.536

38

40

42

44

46

48

50

52

ART

BUSINESS

DISASTERS

ELECTIONSENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGIONSCIENCE

SPORTS

WEATHER

Popularity

Rea

dabi

lity

0 0.5 1 1.5 2 2.50.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

ART

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGION

SCIENCE

SPORTS

WEATHER

Popularity

Ling

uist

ic S

ubje

ctiv

ity

0 0.5 1 1.5 2 2.51

2

3

4

5

6

7

8

9

ART

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGION

SCIENCE

SPORTS

WEATHER

Popularity

Gen

der

Bia

s

35 40 45 50 550.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

ART

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGION

SCIENCE

SPORTS

WEATHER

Readability

Ling

uist

ic S

ubje

ctiv

ity

35 40 45 50 551

2

3

4

5

6

7

8

9

ART

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGION

SCIENCE

SPORTS

WEATHER

Readability

Gen

der

Bia

s

0.01 0.02 0.03 0.04 0.051

2

3

4

5

6

7

8

9

ART

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGION

SCIENCE

SPORTS

WEATHER

Linguistic Subjectivity

Gen

der

Bia

s

ARTBUSINESSDISASTERSELECTIONSENVIRONMENTFASHIONMARKETSPETROLEUMPOLITICSPRICESRELIGIONSCIENCESPORTSWEATHER

I. Flaounas (University of Bristol) January 19, 2010 32 / 57

Page 42: Detecting Patterns in News Media Content

Scatter plot of Outlets

0.02 0.025 0.03 0.035 0.04 0.0452.5

3

3.5

4

4.5

5

BBC

CBS

Daily Mail Daily Mirror

Daily Star

Daily Telegraph

IndependentNewsweek

NY Times

The Guardian

The Sun

Times

Linguistic Subjectivity

Gen

der

Bia

s

I. Flaounas (University of Bristol) January 19, 2010 33 / 57

Page 43: Detecting Patterns in News Media Content

Scatter plot of Outlets

30 35 40 45 50 55 602.5

3

3.5

4

4.5

5

BBC

CBSDaily Mail Daily Mirror

Daily Star

Daily Telegraph

IndependentNewsweek

NY Times

The Guardian

The Sun

Times

Readability

Gen

der

Bia

s

I. Flaounas (University of Bristol) January 19, 2010 34 / 57

Page 44: Detecting Patterns in News Media Content

Scatter plot of Outlets

30 35 40 45 50 55 600.02

0.025

0.03

0.035

0.04

0.045

BBC

CBS

Daily Mail

Daily Mirror

Daily Star

Daily Telegraph

Independent

Newsweek

NY TimesThe Guardian

The Sun

Times

Readability

Ling

uist

ic S

ubje

ctiv

ity

I. Flaounas (University of Bristol) January 19, 2010 35 / 57

Page 45: Detecting Patterns in News Media Content

Overview

1 Introduction

2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation

3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere

4 Conclusions

I. Flaounas (University of Bristol) January 19, 2010 36 / 57

Page 46: Detecting Patterns in News Media Content

Predicting Popular Articles

Editors want to know what stories their readers would like to read. Can wepredict which stories will become popular?

I. Flaounas (University of Bristol) January 19, 2010 37 / 57

Page 47: Detecting Patterns in News Media Content

Predicting Popular Articles

Editors want to know what stories their readers would like to read. Can wepredict which stories will become popular?

Modelling this question as

a simple binary classification problem leads to very low performance.

I. Flaounas (University of Bristol) January 19, 2010 37 / 57

Page 48: Detecting Patterns in News Media Content

Predicting Popular Articles

Editors want to know what stories their readers would like to read. Can wepredict which stories will become popular?

Modelling this question as

a simple binary classification problem leads to very low performance.

a ranking problem can lead to promising results. This is becausepopularity is a relevant concept and not an absolute one.

I. Flaounas (University of Bristol) January 19, 2010 37 / 57

Page 49: Detecting Patterns in News Media Content

Predicting the Popular Articles

Month-by-month predictions, using Ranking SVM.

6 months of data.

Accuracy is the correct orientation of positive/negative pairs of data.

I. Flaounas (University of Bristol) January 19, 2010 38 / 57

Page 50: Detecting Patterns in News Media Content

Most Popular ArticlesTitles of most popular articles per outlet as ranked using Ranking SVMsfor December 2009.

Outlet Titles of Top-3 Articles

CBS Sources: Elin Done with Tiger — Tiger Woods Slapped withTicket for Crash — Tiger Woods: I let my Family Down

FloridaTimes-Union

Pizza delivery woman killed on Westside — A family’s searchfor justice, 15 years later — Rants & Raves: Napolitanounqualified

NYTimes

Poor Children Likelier to Get Antipsychotics — Surf sUp, Way Up, and Competitors Let Out a Big Mahalo —Grandma s Gifts Need Extra Reindeer

Reuters Dubai says not responsible for Dubai World debt — Boe-ing Dreamliner touches down after first flight — Iran’s Ah-madinejad mocks Obama, ”TV series” nuke talks

SeattlePost

Hospital: Actress Brittany Murphy dies at age 32 — ActorCharlie Sheen arrested in Colorado — Charlie Sheen accusedof using weapon in Aspen

I. Flaounas (University of Bristol) January 19, 2010 39 / 57

Page 51: Detecting Patterns in News Media Content

Overview

1 Introduction

2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation

3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere

4 Conclusions

I. Flaounas (University of Bristol) January 19, 2010 40 / 57

Page 52: Detecting Patterns in News Media Content

EU Mediasphere

Top-10 media outlets percountry

over the 27 EU countries

in 22 different languages

for a 6 months period

A total of 1.3M newsitems.

I. Flaounas (University of Bristol) January 19, 2010 41 / 57

Page 53: Detecting Patterns in News Media Content

EU Mediasphere

Top-10 media outlets percountry

over the 27 EU countries

in 22 different languages

for a 6 months period

A total of 1.3M newsitems.

What patterns can we find using modern AI techniques?

I. Flaounas (University of Bristol) January 19, 2010 41 / 57

Page 54: Detecting Patterns in News Media Content

The EU MediasphereCo-coverage network: We link two outlets if they share more stories thanexpected by chance (χ2

− scores).

I. Flaounas (University of Bristol) January 19, 2010 42 / 57

Page 55: Detecting Patterns in News Media Content

The EU MediasphereCo-coverage network: We link two outlets if they share more stories thanexpected by chance (χ2

− scores).

This network has 203 nodes and 6702 edges.I. Flaounas (University of Bristol) January 19, 2010 42 / 57

Page 56: Detecting Patterns in News Media Content

A bit sparser....

This network has 197 nodes, 3386 edges and 3 connected components.Singleton nodes are omitted.

I. Flaounas (University of Bristol) January 19, 2010 43 / 57

Page 57: Detecting Patterns in News Media Content

What kind of connected components are formed?

I. Flaounas (University of Bristol) January 19, 2010 44 / 57

Page 58: Detecting Patterns in News Media Content

What kind of connected components are formed?

I. Flaounas (University of Bristol) January 19, 2010 44 / 57

Page 59: Detecting Patterns in News Media Content

What kind of connected components are formed?

We go as sparse as possible with stopping criterion the modularitymaximization.

I. Flaounas (University of Bristol) January 19, 2010 44 / 57

Page 60: Detecting Patterns in News Media Content

What kind of connected components are formed?

We go as sparse as possible with stopping criterion the modularitymaximization.The probability of two non-singleton nodes from the same country toend up in the same connected component is 82.9% (p < 0.001).

I. Flaounas (University of Bristol) January 19, 2010 44 / 57

Page 61: Detecting Patterns in News Media Content

What kind of connected components are formed?

We go as sparse as possible with stopping criterion the modularitymaximization.The probability of two non-singleton nodes from the same country toend up in the same connected component is 82.9% (p < 0.001).Nationality is the major underline criterion of what stories mediaoutlets choose to publish.

I. Flaounas (University of Bristol) January 19, 2010 44 / 57

Page 62: Detecting Patterns in News Media Content

What kind of connected components are formed?

We go as sparse as possible with stopping criterion the modularitymaximization.The probability of two non-singleton nodes from the same country toend up in the same connected component is 82.9% (p < 0.001).Nationality is the major underline criterion of what stories mediaoutlets choose to publish.

We will work on countries level rather than outlets level.

I. Flaounas (University of Bristol) January 19, 2010 44 / 57

Page 63: Detecting Patterns in News Media Content

Which are the strongest connections between countries?

I. Flaounas (University of Bristol) January 19, 2010 45 / 57

Page 64: Detecting Patterns in News Media Content

Which are the strongest connections between countries?

We go as sparse as possible while keeping the network connected.

This network has 27 nodes and 112 edges.

I. Flaounas (University of Bristol) January 19, 2010 45 / 57

Page 65: Detecting Patterns in News Media Content

Can we explain relations of countries?

I. Flaounas (University of Bristol) January 19, 2010 46 / 57

Page 66: Detecting Patterns in News Media Content

Can we explain relations of countries?

We found significant (p < 0.001) correlation of countries’ media-contentsimilarity to their:

Geographical proximity — based on sharing of borders 33.86%

I. Flaounas (University of Bristol) January 19, 2010 46 / 57

Page 67: Detecting Patterns in News Media Content

Can we explain relations of countries?

We found significant (p < 0.001) correlation of countries’ media-contentsimilarity to their:

Geographical proximity — based on sharing of borders 33.86%

Economical proximity — based on trade volume 31.03%

I. Flaounas (University of Bristol) January 19, 2010 46 / 57

Page 68: Detecting Patterns in News Media Content

Can we explain relations of countries?

We found significant (p < 0.001) correlation of countries’ media-contentsimilarity to their:

Geographical proximity — based on sharing of borders 33.86%

Economical proximity — based on trade volume 31.03%

Cultural proximity — based on song contest votting patterns 32.05%

I. Flaounas (University of Bristol) January 19, 2010 46 / 57

Page 69: Detecting Patterns in News Media Content

Can we explain relations of countries?

We found significant (p < 0.001) correlation of countries’ media-contentsimilarity to their:

Geographical proximity — based on sharing of borders 33.86%

Economical proximity — based on trade volume 31.03%

Cultural proximity — based on song contest votting patterns 32.05%

UK Metro, Dec. 8, 2010: Countries that always vote for each other in theEurovision song contest, have a shared interest in news content, as well asterrible music, a study has shown [...]

I. Flaounas (University of Bristol) January 19, 2010 46 / 57

Page 70: Detecting Patterns in News Media Content

How ‘close’ are countries, based on common media

interests?We use χ2-scores as similarities and project countries in a 2D plane usingMultidimensional Scaling.

I. Flaounas (University of Bristol) January 19, 2010 47 / 57

Page 71: Detecting Patterns in News Media Content

How ‘close’ are countries, based on common media

interests?We use χ2-scores as similarities and project countries in a 2D plane usingMultidimensional Scaling.

I. Flaounas (University of Bristol) January 19, 2010 47 / 57

Page 72: Detecting Patterns in News Media Content

How ‘close’ are countries, based on common media

interests?We use χ2-scores as similarities and project countries in a 2D plane usingMultidimensional Scaling.

We colour the Eurozone membersin blue.

These countries are closer to thecentre, that is the averageEU-media content.

I. Flaounas (University of Bristol) January 19, 2010 48 / 57

Page 73: Detecting Patterns in News Media Content

Ranking of countries

Based on their deviation from average EU media content (in 26D space).

I. Flaounas (University of Bristol) January 19, 2010 49 / 57

Page 74: Detecting Patterns in News Media Content

Ranking of countries

Based on their deviation from average EU media content (in 26D space).

Rank Country Euro A.Year

1 France Y 19572 Austria Y 19953 Germany Y 19574 Greece Y 19815 Ireland Y 19736 Cyprus Y 20047 Slovenia Y 20048 Spain Y 19869 Slovakia Y 200410 Italy Y 195711 Belgium Y 195712 Luxembourg Y 195713 Bulgaria N 200714 Netherlands Y 1957

15 U. Kingdom N 197316 Finland Y 199517 Sweden N 199518 Poland N 200419 Estonia N 200420 Denmark N 197321 Portugal Y 198622 Malta Y 200423 Czech Republic N 200424 Romania N 200725 Latvia N 200426 Hungary N 200427 Lithuania N 2004

I. Flaounas (University of Bristol) January 19, 2010 49 / 57

Page 75: Detecting Patterns in News Media Content

Any other important factors?

Correlations of countries deviation from average EU media content.

Factor Correlation (%) p-values

In Eurozone 70.65 <0.001

I. Flaounas (University of Bristol) January 19, 2010 50 / 57

Page 76: Detecting Patterns in News Media Content

Any other important factors?

Correlations of countries deviation from average EU media content.

Factor Correlation (%) p-values

In Eurozone 70.65 <0.001Accession Year -49.32 0.009

I. Flaounas (University of Bristol) January 19, 2010 50 / 57

Page 77: Detecting Patterns in News Media Content

Any other important factors?

Correlations of countries deviation from average EU media content.

Factor Correlation (%) p-values

In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020

I. Flaounas (University of Bristol) January 19, 2010 50 / 57

Page 78: Detecting Patterns in News Media Content

Any other important factors?

Correlations of countries deviation from average EU media content.

Factor Correlation (%) p-values

In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020Population 23.05 0.247

I. Flaounas (University of Bristol) January 19, 2010 50 / 57

Page 79: Detecting Patterns in News Media Content

Any other important factors?

Correlations of countries deviation from average EU media content.

Factor Correlation (%) p-values

In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020Population 23.05 0.247Area 15.63 0.435

I. Flaounas (University of Bristol) January 19, 2010 50 / 57

Page 80: Detecting Patterns in News Media Content

Any other important factors?

Correlations of countries deviation from average EU media content.

Factor Correlation (%) p-values

In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020Population 23.05 0.247Area 15.63 0.435Population Density 7.45 0.712

The first three factors are significant (p < 0.05), while the rest are not.

I. Flaounas (University of Bristol) January 19, 2010 50 / 57

Page 81: Detecting Patterns in News Media Content

Discussion

EU media editors made independently a multitude of small editorialdecisions which shaped the contents of the EU mediasphere in a waythat reflects its deep geographic, economic and cultural relations.

Detecting these subtle signals in a statistically rigorous way would beout of the reach of traditional methods of social scientists.

This analysis demonstrates the power of the available methods forsignificant automation of media content analysis.

I. Flaounas (University of Bristol) January 19, 2010 51 / 57

Page 82: Detecting Patterns in News Media Content

Conclusions

Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human

‘coders’ can achieve, is not yet reachable.

I. Flaounas (University of Bristol) January 19, 2010 52 / 57

Page 83: Detecting Patterns in News Media Content

Conclusions

Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human

‘coders’ can achieve, is not yet reachable.

Research can be conducted across multiple languages/countries.

I. Flaounas (University of Bristol) January 19, 2010 52 / 57

Page 84: Detecting Patterns in News Media Content

Conclusions

Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human

‘coders’ can achieve, is not yet reachable.

Research can be conducted across multiple languages/countries.

The study can run for a long period of time period involving millionsof articles.

I. Flaounas (University of Bristol) January 19, 2010 52 / 57

Page 85: Detecting Patterns in News Media Content

Conclusions

Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human

‘coders’ can achieve, is not yet reachable.

Research can be conducted across multiple languages/countries.

The study can run for a long period of time period involving millionsof articles.

We challenge questions that could not be answered previously.◮ e.g. Which country’s media have the most negative coverage of

environmental news that also mention Barack Obama?

I. Flaounas (University of Bristol) January 19, 2010 52 / 57

Page 86: Detecting Patterns in News Media Content

Conclusions

Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human

‘coders’ can achieve, is not yet reachable.

Research can be conducted across multiple languages/countries.

The study can run for a long period of time period involving millionsof articles.

We challenge questions that could not be answered previously.◮ e.g. Which country’s media have the most negative coverage of

environmental news that also mention Barack Obama?

In the social sciences, the analysis of news media is done largely byhand in a hypothesis-driven fashion. Is it time for social sciences toalso adopt a data-driven research model?

I. Flaounas (University of Bristol) January 19, 2010 52 / 57

Page 87: Detecting Patterns in News Media Content

Future Work

Under development ideas:

Use of features such as images / audio / video.

How does SMT affect the supervised/unsupervised learning?

Compare the US and the EU mediaspheres

...

I. Flaounas (University of Bristol) January 19, 2010 53 / 57

Page 88: Detecting Patterns in News Media Content

Work from other members of the group

Suffix Tree - Detection of memes

Named Entities detection & disambiguation

Twitter - Events detection

Summarisation of news

Online learning algorithms for news annotation

I. Flaounas (University of Bristol) January 19, 2010 54 / 57

Page 89: Detecting Patterns in News Media Content

References 1

I. Flaounas, M. Turchi, O. Ali, N. Fyson, T. De Bie, N. Mosdell, J.Lewis, N. Cristianini: “The Structure of the EU Mediasphere”, PLoSONE, Vol. 5(12), pp. e14243, 2010.

I. Flaounas, M. Turchi, T. De Bie and N. Cristianini: “Inference andValidation of Networks”, ECML/PKDD, Springer, LNCS, Vol.5782(1), pp. 344–358, Bled, Slovenia, 2009.

O. Ali, I. Flaounas, T. De Bie, N. Mosdell, J. Lewis and N. Cristianini,“Automating News Content Analysis: An Application to Gender Biasand Readability”, JMLR W & CP: Workshop on Applications ofPattern Analysis (WAPA), Vol.11, pp. 36–43, Windsor, UK, 2010.

M. Turchi, I. Flaounas, O. Ali, T De Bie, T. Snowsill and N.Cristianini: “Found in Translation”, ECML/PKDD, Springer, LNCS,Vol. 5782(2), pp. 746–749, Bled, Slovenia, 2009.

I. Flaounas (University of Bristol) January 19, 2010 55 / 57

Page 90: Detecting Patterns in News Media Content

References 2

I. Flaounas, N. Fyson and N. Cristianini: “Predicting Relations inNews-Media Content among EU Countries”, 2nd InternationalWorkshop on Cognitive Information Processing (CIP), IEEE, pp.269–274, Elba, Italy, 2010.

E. Hensinger, I. Flaounas and N. Cristianini: “Learning thePreferences of News Readers with SVM and Lasso Ranking”,Artificial Intelligence Applications and Innovations, Springer, pp.179–186, Larnaca, Cyprus, 2010.

T. Snowsill, I. Flaounas, T. De Bie and N. Cristianini: “Detectingevents in a million New York Times articles”, ECML/PKDD,Springer, LNCS, Vol. 6323(3), pp. 615–618, Barcelona, Spain, 2010.

I. Flaounas, M. Turchi and N. Cristianini: “Detecting Macro-Patternsin the European Mediasphere”, IEEE/WIC/ACM InternationalConference on Web Intelligence and Intelligent Agent Technology, pp.527–530, Milano, Italy, 2009.

I. Flaounas (University of Bristol) January 19, 2010 56 / 57

Page 91: Detecting Patterns in News Media Content

More info at: http://mediapatterns.enm.bris.ac.uk

Thank you!

I. Flaounas (University of Bristol) January 19, 2010 57 / 57