detecting patterns in news media content

Detecting Patterns in News Media Content

Ilias Flaounas

University of Bristol

January 19, 2010

I. Flaounas (University of Bristol) January 19, 2010 1 / 57

Overview

1 Introduction

2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation

3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere

4 Conclusions

Introduction

The global media system has animportant role in democracy,commerce and culture.

News outlets form a vast,complex and interconnectedinformation system.

The system operates in a globalscale, with information beinggenerated and gathered,processed, and distributed invarious ways before it reachesthe final users.

Terminology

News Outlet or News-media: a source that reports news such as anewspaper, a journal, a TV or Radio station...

News-item or article: a news piece reported in a news outlet thatrefers to a specific event.

Story: a collection of news items that refer to the same event.

Mediaspere: the collective ecology of the world’s media.

Corpus: a collection of news items.

Coding: the manual annotation of news-items.

Traditional approach

Analysis of news-media content is a domain of research of social scientists.But they have many limitations.

few outlets per study (< 10)

small numbers of news-items (few hundreds in best cases)

small time periods (few days)

news-items from a single country’s media

manual annotation – ‘coding’

they rely on commercial databases such as LexisNexis and theirconstrains.

research is fully hypothesis driven

Examples of traditional studies

Papers published in recent issues of the Journal of Communication:

“A total of 529 stories from NBC Nightly News and 322 stories airedon Special Report about Iraq, and 64 and 47, respectively, aboutAfghanistan were analysed by two coders”S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War

Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010).

Examples of traditional studies

Papers published in recent issues of the Journal of Communication:

“A total of 529 stories from NBC Nightly News and 322 stories airedon Special Report about Iraq, and 64 and 47, respectively, aboutAfghanistan were analysed by two coders”S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War

Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010).

“Our corpus of data consisted of Channel 2s broadcasts on the eve ofMDHH between 7:30 p.m. and midnight in the years 1994-2007[...].All 278 items aired on the 14 examined evenings were coded.”O. Meyers et al. “Prime Time Commemoration: An Analysis of Television

Broadcasts on Israel’s Memorial Day for the Holocaust and the Heroism”, J. of

Com. 59, 456-480 (2009).

Example of Coding Scheme

These questionnaires haveto be completedmanually.

The same questionnairehas to be completed bymore than one coder forthe same news items.

This is a fully hypothesisdriven research model.

But nowadays...

Most media offer their content online in a convenient form.

Research Focus

In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.

Research Focus

‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.

Research Focus

‘Traditional news-media’ since we do not focus on modern online-onlynews spreading means such as blogs or Twitter.

Research Focus

‘Textual’ since we use only the textual information of news ratherthan analysing e.g. images, videos, or speech.

Research Focus

‘Textual’ since we use only the textual information of news ratherthan analysing e.g. images, videos, or speech.

‘Automated’ in the sense that the analysis is performed by applyingArtificial Intelligence techniques rather than using human ‘coders’.

Relevant Work & Datasets

Europe Media Monitor (EMM)

‘Lydia’ system

Newsblaster

NewsInEssence

Google News, Yahoo! News

LexisNexis

Public Corpora: Reuters, New York Times

Relevant Work & Datasets

Europe Media Monitor (EMM)

‘Lydia’ system

Newsblaster

NewsInEssence

Google News, Yahoo! News

LexisNexis

Public Corpora: Reuters, New York Times

We are highly interested in studying the media system per se.

Overview

1 Introduction

4 Conclusions

Automating News Content Analysis

Some automation is possible using AI approaches:◮ Machine Learning◮ Data Mining◮ Natural Language Processing

Some questions about media system can be answered for the firsttime.

We apply methods that work efficiently and reliably on large-scaledata.

We can attempt a data-driven research model.

Methods Summary

RSS parsing, Web page content scrapping

Text Preprocessing: Stemming, stop-words removal, TF-IDF, ...

Tagging: Support Vector Machines.

Clustering: Best Reciprocal Hit

Ranking: SVM rank

Words selection: Lasso, SVMs

Network reconstruction: χ2-test ...

NLP: Statistical Machine Translation, Sentiment Analysis, Readability

Data Visualisation: Multidimensional Scaling, Spring Embedding, ...

Statistics: correlations, significance tests...

Building & Annotating the Corpus

Our corpus in numbers:

> 1300 multilingual news sources

> 3000 news feeds

133 countries

22 languages

> 3 years of continuous monitoring

40K news items per day

> 30M news items in total

NOAM: News Outlets Analysis & Monitoring system

NOAM enable us to query the corpus at semantic level.

Statistical Machine Translation1

We applied a phrase based Statistical Machine Translation (SMT)approach for translating the non-English articles to English.

We use Moses, a complete phrase based translation toolkit foracademic purposes.

We translate all non-English articles of 21 EU languages into English.

For each language pair, an instance of Moses is trained using Europarldata and JRC-Acquis Multilingual Parallel Corpus.

We make the working assumption that SMT does not altersignificantly the geometry of the news corpus in the vector-spacerepresentation.

1Acknowledgements to Marco Turchi for implementing the SMT module.I. Flaounas (University of Bristol) January 19, 2010 16 / 57

Articles Per Day

0 50 100 150 200 2500.6

2.2x 10

#Days starting from June 1st, 2009

We observe:

A seven days cycle

Local minima during weekend days.

Clustering Articles

The Best Reciprocal Hit method:

Outlets Per Story

#Outlets per Story.

Few stories are covered by lots of media, and lots of stories are covered byfew media.

The Global Mediasphere

543 nodes, 4783 edges, colour by countryI. Flaounas (University of Bristol) January 19, 2010 20 / 57

Support Vector Machines as Topic Taggers

We train on data from two well accepted corpora:◮ Reuters◮ NY Times

Typical text preprocessing: Stemming, stop-words removal,bag-of-words (TF-IDF) representation...

Two-class SVMs

Cosine kernel

Maximize F0.5-Score on unseen data

Train to recognise 14 interesting news topics

SVM Taggers

Topic Corpus F0.5-Score Precision Recall

1 SPORTS Reuters 97.78 98.31 95.752 MARKETS Reuters 92.02 94.09 84.633 FASHION Reuters 83.88 94.61 71.274 DISASTERS Reuters 83.4 87.69 70.345 ART NY Times 81.67 84.9 71.386 BUSINESS NY Times 81.16 86.23 65.877 INFLATION-PRICES Reuters 77.01 81.45 63.388 RELIGION NY Times 74.95 83.57 53.599 POLITICS NY Times 73.81 76.65 64.8110 SCIENCE Reuters 73.63 83.72 50.6211 WEATHER Reuters 71.43 82.91 46.8412 PETROLEUM Reuters 70.67 75.14 58.7313 ELECTIONS Reuters 70.32 78.99 49.3214 ENVIRONMENT NY Times 64.29 73.48 43.7

Demo: Found In Translation

We implemented a demo to demonstrate the state of the art in variousdisciplines of modern Artificial Intelligence.

We compare the EU countries according to what topics their mediachoose to cover.

Everyday we machine-translate 640 EU media content

Annotate them using SVMs

Compare EU countries media content based on their Top-10 media

Demo: Found In Translation

http://foundintranslation.enm.bris.ac.uk

Overview

1 Introduction

4 Conclusions

Detection of Biases

The goal is to measure typical biases among different topics as they arepresented in the media:

Readability

Linguistic Subjectivity

Popularity

Gender Bias

Corpus:

500 English-language media

10 months, (Jan. 1st, 2010 – Oct 31st, 2011)

2.5M articles appeared in main feed

Readability

We measure readability basedon the Flesch Reading EaseTest

The higher the FRET theeasier the text to read.

Scores range from 0–100.

10K items per topic

FRET (article) = 206.835 − (1.015 · ASL)− 84.6 · ASW

We measure thepercentage of sentimentaladjectives over the totalnumber of adjectives.

Adjectives detection byStanford POS tagger.

We check for eachadjective the presence of aSentiWordnet sentimentalscore > 0.25.

Validation of Linguistic Subjectivity?

This is a challenge due to miss of a golden standard.

Validation of Linguistic Subjectivity?

This is a challenge due to miss of a golden standard.A subset of leading news-outlets, ranked by their linguistic subjectivity.

Rank Outlet

1 BBC2 Times3 NY Times4 The Guardian5 CBS6 Daily Telegraph7 Daily Star(T)8 Independent9 Daily Mail (T)10 Daily Mirror (T)11 Newsweek12 The sun (T)

Popularity

We measure the conditionalprobability of an article tobecome popular given itstopic.

We track 16 English languageoutlets that provide a “Mostpopular” feed.

◮ In total 108,516 articleswere popular.

◮ 36,788 articles werepopular and appeared inthe main feed.

P(Pop/Topic) =P(Topic/Pop) · P(Pop)

P(Topic)

Scatter plot of Topics

0 0.5 1 1.5 2 2.536

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGIONSCIENCE

SPORTS

WEATHER

Popularity

Scatter plot of Topics

0 0.5 1 1.5 2 2.536

BUSINESS

DISASTERS

ELECTIONSENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGIONSCIENCE

SPORTS

WEATHER

Popularity

0 0.5 1 1.5 2 2.50.01

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGION

SCIENCE

SPORTS

WEATHER

Popularity

0 0.5 1 1.5 2 2.51

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGION

SCIENCE

SPORTS

WEATHER

Popularity

35 40 45 50 550.01

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGION

SCIENCE

SPORTS

WEATHER

Readability

35 40 45 50 551

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGION

SCIENCE

SPORTS

WEATHER

Readability

0.01 0.02 0.03 0.04 0.051

BUSINESS

DISASTERS

ELECTIONS

ENVIRONMENT

FASHION

MARKETS

PETROLEUM

POLITICS

PRICES

RELIGION

SCIENCE

SPORTS

WEATHER

ARTBUSINESSDISASTERSELECTIONSENVIRONMENTFASHIONMARKETSPETROLEUMPOLITICSPRICESRELIGIONSCIENCESPORTSWEATHER

Scatter plot of Outlets

0.02 0.025 0.03 0.035 0.04 0.0452.5

Daily Mail Daily Mirror

Daily Star

Daily Telegraph

IndependentNewsweek

NY Times

The Guardian

The Sun

30 35 40 45 50 55 602.5

CBSDaily Mail Daily Mirror

Daily Star

Daily Telegraph

IndependentNewsweek

NY Times

The Guardian

The Sun

Readability

30 35 40 45 50 55 600.02

Daily Mail

Daily Mirror

Daily Star

Daily Telegraph

Independent

Newsweek

NY TimesThe Guardian

The Sun

Readability

Overview

1 Introduction

4 Conclusions

Predicting Popular Articles

Editors want to know what stories their readers would like to read. Can wepredict which stories will become popular?

Modelling this question as

a simple binary classification problem leads to very low performance.

Modelling this question as

a simple binary classification problem leads to very low performance.

a ranking problem can lead to promising results. This is becausepopularity is a relevant concept and not an absolute one.

Predicting the Popular Articles

Month-by-month predictions, using Ranking SVM.

6 months of data.

Accuracy is the correct orientation of positive/negative pairs of data.

Most Popular ArticlesTitles of most popular articles per outlet as ranked using Ranking SVMsfor December 2009.

Outlet Titles of Top-3 Articles

CBS Sources: Elin Done with Tiger — Tiger Woods Slapped withTicket for Crash — Tiger Woods: I let my Family Down

FloridaTimes-Union

Pizza delivery woman killed on Westside — A family’s searchfor justice, 15 years later — Rants & Raves: Napolitanounqualified

NYTimes

Poor Children Likelier to Get Antipsychotics — Surf sUp, Way Up, and Competitors Let Out a Big Mahalo —Grandma s Gifts Need Extra Reindeer

Reuters Dubai says not responsible for Dubai World debt — Boe-ing Dreamliner touches down after first flight — Iran’s Ah-madinejad mocks Obama, ”TV series” nuke talks

SeattlePost

Hospital: Actress Brittany Murphy dies at age 32 — ActorCharlie Sheen arrested in Colorado — Charlie Sheen accusedof using weapon in Aspen

Overview

1 Introduction

4 Conclusions

EU Mediasphere

Top-10 media outlets percountry

over the 27 EU countries

in 22 different languages

for a 6 months period

A total of 1.3M newsitems.

EU Mediasphere

Top-10 media outlets percountry

over the 27 EU countries

in 22 different languages

for a 6 months period

A total of 1.3M newsitems.

What patterns can we find using modern AI techniques?

The EU MediasphereCo-coverage network: We link two outlets if they share more stories thanexpected by chance (χ2

− scores).

The EU MediasphereCo-coverage network: We link two outlets if they share more stories thanexpected by chance (χ2

− scores).

This network has 203 nodes and 6702 edges.I. Flaounas (University of Bristol) January 19, 2010 42 / 57

A bit sparser....

This network has 197 nodes, 3386 edges and 3 connected components.Singleton nodes are omitted.

What kind of connected components are formed?

We go as sparse as possible with stopping criterion the modularitymaximization.

We go as sparse as possible with stopping criterion the modularitymaximization.The probability of two non-singleton nodes from the same country toend up in the same connected component is 82.9% (p < 0.001).

We go as sparse as possible with stopping criterion the modularitymaximization.The probability of two non-singleton nodes from the same country toend up in the same connected component is 82.9% (p < 0.001).Nationality is the major underline criterion of what stories mediaoutlets choose to publish.

We will work on countries level rather than outlets level.

Which are the strongest connections between countries?

We go as sparse as possible while keeping the network connected.

This network has 27 nodes and 112 edges.

Can we explain relations of countries?

We found significant (p < 0.001) correlation of countries’ media-contentsimilarity to their:

Geographical proximity — based on sharing of borders 33.86%

Economical proximity — based on trade volume 31.03%

Cultural proximity — based on song contest votting patterns 32.05%

UK Metro, Dec. 8, 2010: Countries that always vote for each other in theEurovision song contest, have a shared interest in news content, as well asterrible music, a study has shown [...]

How ‘close’ are countries, based on common media

interests?We use χ2-scores as similarities and project countries in a 2D plane usingMultidimensional Scaling.

We colour the Eurozone membersin blue.

These countries are closer to thecentre, that is the averageEU-media content.

Ranking of countries

Based on their deviation from average EU media content (in 26D space).

Ranking of countries

Based on their deviation from average EU media content (in 26D space).

Rank Country Euro A.Year

1 France Y 19572 Austria Y 19953 Germany Y 19574 Greece Y 19815 Ireland Y 19736 Cyprus Y 20047 Slovenia Y 20048 Spain Y 19869 Slovakia Y 200410 Italy Y 195711 Belgium Y 195712 Luxembourg Y 195713 Bulgaria N 200714 Netherlands Y 1957

15 U. Kingdom N 197316 Finland Y 199517 Sweden N 199518 Poland N 200419 Estonia N 200420 Denmark N 197321 Portugal Y 198622 Malta Y 200423 Czech Republic N 200424 Romania N 200725 Latvia N 200426 Hungary N 200427 Lithuania N 2004

Any other important factors?

Correlations of countries deviation from average EU media content.

Factor Correlation (%) p-values

In Eurozone 70.65 <0.001

In Eurozone 70.65 <0.001Accession Year -49.32 0.009

In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020

In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020Population 23.05 0.247

In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020Population 23.05 0.247Area 15.63 0.435

In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020Population 23.05 0.247Area 15.63 0.435Population Density 7.45 0.712

The first three factors are significant (p < 0.05), while the rest are not.

Discussion

EU media editors made independently a multitude of small editorialdecisions which shaped the contents of the EU mediasphere in a waythat reflects its deep geographic, economic and cultural relations.

Detecting these subtle signals in a statistically rigorous way would beout of the reach of traditional methods of social scientists.

This analysis demonstrates the power of the available methods forsignificant automation of media content analysis.

Conclusions

Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human

‘coders’ can achieve, is not yet reachable.

Conclusions

Research can be conducted across multiple languages/countries.

Conclusions

The study can run for a long period of time period involving millionsof articles.

Conclusions

We challenge questions that could not be answered previously.◮ e.g. Which country’s media have the most negative coverage of

environmental news that also mention Barack Obama?

Conclusions

We challenge questions that could not be answered previously.◮ e.g. Which country’s media have the most negative coverage of

environmental news that also mention Barack Obama?

In the social sciences, the analysis of news media is done largely byhand in a hypothesis-driven fashion. Is it time for social sciences toalso adopt a data-driven research model?

Future Work

Under development ideas:

Use of features such as images / audio / video.

How does SMT affect the supervised/unsupervised learning?

Compare the US and the EU mediaspheres

Work from other members of the group

Suffix Tree - Detection of memes

Named Entities detection & disambiguation

Twitter - Events detection

Summarisation of news

Online learning algorithms for news annotation

References 1

I. Flaounas, M. Turchi, O. Ali, N. Fyson, T. De Bie, N. Mosdell, J.Lewis, N. Cristianini: “The Structure of the EU Mediasphere”, PLoSONE, Vol. 5(12), pp. e14243, 2010.

I. Flaounas, M. Turchi, T. De Bie and N. Cristianini: “Inference andValidation of Networks”, ECML/PKDD, Springer, LNCS, Vol.5782(1), pp. 344–358, Bled, Slovenia, 2009.

O. Ali, I. Flaounas, T. De Bie, N. Mosdell, J. Lewis and N. Cristianini,“Automating News Content Analysis: An Application to Gender Biasand Readability”, JMLR W & CP: Workshop on Applications ofPattern Analysis (WAPA), Vol.11, pp. 36–43, Windsor, UK, 2010.

M. Turchi, I. Flaounas, O. Ali, T De Bie, T. Snowsill and N.Cristianini: “Found in Translation”, ECML/PKDD, Springer, LNCS,Vol. 5782(2), pp. 746–749, Bled, Slovenia, 2009.

References 2

I. Flaounas, N. Fyson and N. Cristianini: “Predicting Relations inNews-Media Content among EU Countries”, 2nd InternationalWorkshop on Cognitive Information Processing (CIP), IEEE, pp.269–274, Elba, Italy, 2010.

E. Hensinger, I. Flaounas and N. Cristianini: “Learning thePreferences of News Readers with SVM and Lasso Ranking”,Artificial Intelligence Applications and Innovations, Springer, pp.179–186, Larnaca, Cyprus, 2010.

T. Snowsill, I. Flaounas, T. De Bie and N. Cristianini: “Detectingevents in a million New York Times articles”, ECML/PKDD,Springer, LNCS, Vol. 6323(3), pp. 615–618, Barcelona, Spain, 2010.

I. Flaounas, M. Turchi and N. Cristianini: “Detecting Macro-Patternsin the European Mediasphere”, IEEE/WIC/ACM InternationalConference on Web Intelligence and Intelligent Agent Technology, pp.527–530, Milano, Italy, 2009.

More info at: http://mediapatterns.enm.bris.ac.uk

Thank you!

detecting patterns in news media content

bad news

news outlets

news piece

collection of news items

days newsitems

nbc nightly news

fox news channel

manual annotation of

Documents

towards detecting performance anti-patterns using...

detecting evolving patterns of selforganizing networks by

detecting patterns of anomalies

inferring project-specific bug patterns for detecting...

detecting performance anti-patterns for applications...

detecting conserved interaction patterns in biological

detecting contrast patterns in...

detecting usage patterns · 2019-07-09 · detecting usage...

detecting symmetry in cellular automata generated patterns...

detecting collaboration patterns among ischools by linking...

detecting palindromes, patterns and borders in regular

detecting learning patterns during exercise from function...

detecting topological patterns in protein networks sergei...

detecting performance anti-patterns for …detecting...

detecting one-variable patterns · 2018-07-10 · detecting...

detecting (viral) news in digitized historical

finding remo - detecting relative motion patterns in

changing patterns of digital news consumption...

effective methods for detecting interesting patterns...

detecting framing changes in topical news