detecting patterns in news media content
Post on 18-Aug-2015
36 Views
Preview:
TRANSCRIPT
Detecting Patterns in News Media Content
Ilias Flaounas
University of Bristol
January 19, 2010
I. Flaounas (University of Bristol) January 19, 2010 1 / 57
Overview
1 Introduction
2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation
3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere
4 Conclusions
I. Flaounas (University of Bristol) January 19, 2010 2 / 57
Introduction
The global media system has animportant role in democracy,commerce and culture.
News outlets form a vast,complex and interconnectedinformation system.
The system operates in a globalscale, with information beinggenerated and gathered,processed, and distributed invarious ways before it reachesthe final users.
I. Flaounas (University of Bristol) January 19, 2010 3 / 57
Terminology
News Outlet or News-media: a source that reports news such as anewspaper, a journal, a TV or Radio station...
News-item or article: a news piece reported in a news outlet thatrefers to a specific event.
Story: a collection of news items that refer to the same event.
Mediaspere: the collective ecology of the world’s media.
Corpus: a collection of news items.
Coding: the manual annotation of news-items.
I. Flaounas (University of Bristol) January 19, 2010 4 / 57
Traditional approach
Analysis of news-media content is a domain of research of social scientists.But they have many limitations.
few outlets per study (< 10)
small numbers of news-items (few hundreds in best cases)
small time periods (few days)
news-items from a single country’s media
manual annotation – ‘coding’
they rely on commercial databases such as LexisNexis and theirconstrains.
research is fully hypothesis driven
I. Flaounas (University of Bristol) January 19, 2010 5 / 57
Examples of traditional studies
Papers published in recent issues of the Journal of Communication:
“A total of 529 stories from NBC Nightly News and 322 stories airedon Special Report about Iraq, and 64 and 47, respectively, aboutAfghanistan were analysed by two coders”S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War
Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010).
I. Flaounas (University of Bristol) January 19, 2010 6 / 57
Examples of traditional studies
Papers published in recent issues of the Journal of Communication:
“A total of 529 stories from NBC Nightly News and 322 stories airedon Special Report about Iraq, and 64 and 47, respectively, aboutAfghanistan were analysed by two coders”S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War
Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010).
“Our corpus of data consisted of Channel 2s broadcasts on the eve ofMDHH between 7:30 p.m. and midnight in the years 1994-2007[...].All 278 items aired on the 14 examined evenings were coded.”O. Meyers et al. “Prime Time Commemoration: An Analysis of Television
Broadcasts on Israel’s Memorial Day for the Holocaust and the Heroism”, J. of
Com. 59, 456-480 (2009).
I. Flaounas (University of Bristol) January 19, 2010 6 / 57
Example of Coding Scheme
These questionnaires haveto be completedmanually.
The same questionnairehas to be completed bymore than one coder forthe same news items.
This is a fully hypothesisdriven research model.
I. Flaounas (University of Bristol) January 19, 2010 7 / 57
But nowadays...
Most media offer their content online in a convenient form.
I. Flaounas (University of Bristol) January 19, 2010 8 / 57
Research Focus
In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.
I. Flaounas (University of Bristol) January 19, 2010 9 / 57
Research Focus
In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.
‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.
I. Flaounas (University of Bristol) January 19, 2010 9 / 57
Research Focus
In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.
‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.
‘Traditional news-media’ since we do not focus on modern online-onlynews spreading means such as blogs or Twitter.
I. Flaounas (University of Bristol) January 19, 2010 9 / 57
Research Focus
In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.
‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.
‘Traditional news-media’ since we do not focus on modern online-onlynews spreading means such as blogs or Twitter.
‘Textual’ since we use only the textual information of news ratherthan analysing e.g. images, videos, or speech.
I. Flaounas (University of Bristol) January 19, 2010 9 / 57
Research Focus
In our research we undertake a large-scale traditional news-media textualcontent analysis using automated techniques.
‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.
‘Traditional news-media’ since we do not focus on modern online-onlynews spreading means such as blogs or Twitter.
‘Textual’ since we use only the textual information of news ratherthan analysing e.g. images, videos, or speech.
‘Automated’ in the sense that the analysis is performed by applyingArtificial Intelligence techniques rather than using human ‘coders’.
I. Flaounas (University of Bristol) January 19, 2010 9 / 57
Relevant Work & Datasets
Europe Media Monitor (EMM)
‘Lydia’ system
Newsblaster
NewsInEssence
Google News, Yahoo! News
LexisNexis
Public Corpora: Reuters, New York Times
I. Flaounas (University of Bristol) January 19, 2010 10 / 57
Relevant Work & Datasets
Europe Media Monitor (EMM)
‘Lydia’ system
Newsblaster
NewsInEssence
Google News, Yahoo! News
LexisNexis
Public Corpora: Reuters, New York Times
We are highly interested in studying the media system per se.
I. Flaounas (University of Bristol) January 19, 2010 10 / 57
Overview
1 Introduction
2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation
3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere
4 Conclusions
I. Flaounas (University of Bristol) January 19, 2010 11 / 57
Automating News Content Analysis
Some automation is possible using AI approaches:◮ Machine Learning◮ Data Mining◮ Natural Language Processing
Some questions about media system can be answered for the firsttime.
We apply methods that work efficiently and reliably on large-scaledata.
We can attempt a data-driven research model.
I. Flaounas (University of Bristol) January 19, 2010 12 / 57
Methods Summary
RSS parsing, Web page content scrapping
Text Preprocessing: Stemming, stop-words removal, TF-IDF, ...
Tagging: Support Vector Machines.
Clustering: Best Reciprocal Hit
Ranking: SVM rank
Words selection: Lasso, SVMs
Network reconstruction: χ2-test ...
NLP: Statistical Machine Translation, Sentiment Analysis, Readability
Data Visualisation: Multidimensional Scaling, Spring Embedding, ...
Statistics: correlations, significance tests...
I. Flaounas (University of Bristol) January 19, 2010 13 / 57
Building & Annotating the Corpus
Our corpus in numbers:
> 1300 multilingual news sources
> 3000 news feeds
133 countries
22 languages
> 3 years of continuous monitoring
40K news items per day
> 30M news items in total
I. Flaounas (University of Bristol) January 19, 2010 14 / 57
NOAM: News Outlets Analysis & Monitoring system
I. Flaounas (University of Bristol) January 19, 2010 15 / 57
NOAM: News Outlets Analysis & Monitoring system
NOAM enable us to query the corpus at semantic level.
I. Flaounas (University of Bristol) January 19, 2010 15 / 57
Statistical Machine Translation1
We applied a phrase based Statistical Machine Translation (SMT)approach for translating the non-English articles to English.
We use Moses, a complete phrase based translation toolkit foracademic purposes.
We translate all non-English articles of 21 EU languages into English.
For each language pair, an instance of Moses is trained using Europarldata and JRC-Acquis Multilingual Parallel Corpus.
We make the working assumption that SMT does not altersignificantly the geometry of the news corpus in the vector-spacerepresentation.
1Acknowledgements to Marco Turchi for implementing the SMT module.I. Flaounas (University of Bristol) January 19, 2010 16 / 57
Articles Per Day
0 50 100 150 200 2500.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2x 10
4
#Days starting from June 1st, 2009
#Art
icle
s
We observe:
A seven days cycle
Local minima during weekend days.
I. Flaounas (University of Bristol) January 19, 2010 17 / 57
Clustering Articles
The Best Reciprocal Hit method:
I. Flaounas (University of Bristol) January 19, 2010 18 / 57
Outlets Per Story
100
101
102
101
102
103
104
105
106
#Outlets per Story.
#Sto
ries.
Few stories are covered by lots of media, and lots of stories are covered byfew media.
I. Flaounas (University of Bristol) January 19, 2010 19 / 57
The Global Mediasphere
I. Flaounas (University of Bristol) January 19, 2010 20 / 57
The Global Mediasphere
543 nodes, 4783 edges, colour by countryI. Flaounas (University of Bristol) January 19, 2010 20 / 57
Support Vector Machines as Topic Taggers
We train on data from two well accepted corpora:◮ Reuters◮ NY Times
Typical text preprocessing: Stemming, stop-words removal,bag-of-words (TF-IDF) representation...
Two-class SVMs
Cosine kernel
Maximize F0.5-Score on unseen data
Train to recognise 14 interesting news topics
I. Flaounas (University of Bristol) January 19, 2010 21 / 57
SVM Taggers
Topic Corpus F0.5-Score Precision Recall
1 SPORTS Reuters 97.78 98.31 95.752 MARKETS Reuters 92.02 94.09 84.633 FASHION Reuters 83.88 94.61 71.274 DISASTERS Reuters 83.4 87.69 70.345 ART NY Times 81.67 84.9 71.386 BUSINESS NY Times 81.16 86.23 65.877 INFLATION-PRICES Reuters 77.01 81.45 63.388 RELIGION NY Times 74.95 83.57 53.599 POLITICS NY Times 73.81 76.65 64.8110 SCIENCE Reuters 73.63 83.72 50.6211 WEATHER Reuters 71.43 82.91 46.8412 PETROLEUM Reuters 70.67 75.14 58.7313 ELECTIONS Reuters 70.32 78.99 49.3214 ENVIRONMENT NY Times 64.29 73.48 43.7
I. Flaounas (University of Bristol) January 19, 2010 22 / 57
Demo: Found In Translation
We implemented a demo to demonstrate the state of the art in variousdisciplines of modern Artificial Intelligence.
We compare the EU countries according to what topics their mediachoose to cover.
Everyday we machine-translate 640 EU media content
Annotate them using SVMs
Compare EU countries media content based on their Top-10 media
I. Flaounas (University of Bristol) January 19, 2010 23 / 57
Demo: Found In Translation
http://foundintranslation.enm.bris.ac.uk
I. Flaounas (University of Bristol) January 19, 2010 24 / 57
Overview
1 Introduction
2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation
3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere
4 Conclusions
I. Flaounas (University of Bristol) January 19, 2010 25 / 57
Detection of Biases
The goal is to measure typical biases among different topics as they arepresented in the media:
Readability
Linguistic Subjectivity
Popularity
Gender Bias
Corpus:
500 English-language media
10 months, (Jan. 1st, 2010 – Oct 31st, 2011)
2.5M articles appeared in main feed
I. Flaounas (University of Bristol) January 19, 2010 26 / 57
Readability
We measure readability basedon the Flesch Reading EaseTest
The higher the FRET theeasier the text to read.
Scores range from 0–100.
10K items per topic
FRET (article) = 206.835 − (1.015 · ASL)− 84.6 · ASW
I. Flaounas (University of Bristol) January 19, 2010 27 / 57
Linguistic Subjectivity
We measure thepercentage of sentimentaladjectives over the totalnumber of adjectives.
Adjectives detection byStanford POS tagger.
We check for eachadjective the presence of aSentiWordnet sentimentalscore > 0.25.
I. Flaounas (University of Bristol) January 19, 2010 28 / 57
Validation of Linguistic Subjectivity?
This is a challenge due to miss of a golden standard.
I. Flaounas (University of Bristol) January 19, 2010 29 / 57
Validation of Linguistic Subjectivity?
This is a challenge due to miss of a golden standard.A subset of leading news-outlets, ranked by their linguistic subjectivity.
Rank Outlet
1 BBC2 Times3 NY Times4 The Guardian5 CBS6 Daily Telegraph7 Daily Star(T)8 Independent9 Daily Mail (T)10 Daily Mirror (T)11 Newsweek12 The sun (T)
I. Flaounas (University of Bristol) January 19, 2010 29 / 57
Popularity
We measure the conditionalprobability of an article tobecome popular given itstopic.
We track 16 English languageoutlets that provide a “Mostpopular” feed.
◮ In total 108,516 articleswere popular.
◮ 36,788 articles werepopular and appeared inthe main feed.
P(Pop/Topic) =P(Topic/Pop) · P(Pop)
P(Topic)
I. Flaounas (University of Bristol) January 19, 2010 30 / 57
Scatter plot of Topics
0 0.5 1 1.5 2 2.536
38
40
42
44
46
48
50
52
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGIONSCIENCE
SPORTS
WEATHER
Popularity
Rea
dabi
lity
I. Flaounas (University of Bristol) January 19, 2010 31 / 57
Scatter plot of Topics
0 0.5 1 1.5 2 2.536
38
40
42
44
46
48
50
52
ART
BUSINESS
DISASTERS
ELECTIONSENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGIONSCIENCE
SPORTS
WEATHER
Popularity
Rea
dabi
lity
0 0.5 1 1.5 2 2.50.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Popularity
Ling
uist
ic S
ubje
ctiv
ity
0 0.5 1 1.5 2 2.51
2
3
4
5
6
7
8
9
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Popularity
Gen
der
Bia
s
35 40 45 50 550.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Readability
Ling
uist
ic S
ubje
ctiv
ity
35 40 45 50 551
2
3
4
5
6
7
8
9
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Readability
Gen
der
Bia
s
0.01 0.02 0.03 0.04 0.051
2
3
4
5
6
7
8
9
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Linguistic Subjectivity
Gen
der
Bia
s
ARTBUSINESSDISASTERSELECTIONSENVIRONMENTFASHIONMARKETSPETROLEUMPOLITICSPRICESRELIGIONSCIENCESPORTSWEATHER
I. Flaounas (University of Bristol) January 19, 2010 32 / 57
Scatter plot of Outlets
0.02 0.025 0.03 0.035 0.04 0.0452.5
3
3.5
4
4.5
5
BBC
CBS
Daily Mail Daily Mirror
Daily Star
Daily Telegraph
IndependentNewsweek
NY Times
The Guardian
The Sun
Times
Linguistic Subjectivity
Gen
der
Bia
s
I. Flaounas (University of Bristol) January 19, 2010 33 / 57
Scatter plot of Outlets
30 35 40 45 50 55 602.5
3
3.5
4
4.5
5
BBC
CBSDaily Mail Daily Mirror
Daily Star
Daily Telegraph
IndependentNewsweek
NY Times
The Guardian
The Sun
Times
Readability
Gen
der
Bia
s
I. Flaounas (University of Bristol) January 19, 2010 34 / 57
Scatter plot of Outlets
30 35 40 45 50 55 600.02
0.025
0.03
0.035
0.04
0.045
BBC
CBS
Daily Mail
Daily Mirror
Daily Star
Daily Telegraph
Independent
Newsweek
NY TimesThe Guardian
The Sun
Times
Readability
Ling
uist
ic S
ubje
ctiv
ity
I. Flaounas (University of Bristol) January 19, 2010 35 / 57
Overview
1 Introduction
2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation
3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere
4 Conclusions
I. Flaounas (University of Bristol) January 19, 2010 36 / 57
Predicting Popular Articles
Editors want to know what stories their readers would like to read. Can wepredict which stories will become popular?
I. Flaounas (University of Bristol) January 19, 2010 37 / 57
Predicting Popular Articles
Editors want to know what stories their readers would like to read. Can wepredict which stories will become popular?
Modelling this question as
a simple binary classification problem leads to very low performance.
I. Flaounas (University of Bristol) January 19, 2010 37 / 57
Predicting Popular Articles
Editors want to know what stories their readers would like to read. Can wepredict which stories will become popular?
Modelling this question as
a simple binary classification problem leads to very low performance.
a ranking problem can lead to promising results. This is becausepopularity is a relevant concept and not an absolute one.
I. Flaounas (University of Bristol) January 19, 2010 37 / 57
Predicting the Popular Articles
Month-by-month predictions, using Ranking SVM.
6 months of data.
Accuracy is the correct orientation of positive/negative pairs of data.
I. Flaounas (University of Bristol) January 19, 2010 38 / 57
Most Popular ArticlesTitles of most popular articles per outlet as ranked using Ranking SVMsfor December 2009.
Outlet Titles of Top-3 Articles
CBS Sources: Elin Done with Tiger — Tiger Woods Slapped withTicket for Crash — Tiger Woods: I let my Family Down
FloridaTimes-Union
Pizza delivery woman killed on Westside — A family’s searchfor justice, 15 years later — Rants & Raves: Napolitanounqualified
NYTimes
Poor Children Likelier to Get Antipsychotics — Surf sUp, Way Up, and Competitors Let Out a Big Mahalo —Grandma s Gifts Need Extra Reindeer
Reuters Dubai says not responsible for Dubai World debt — Boe-ing Dreamliner touches down after first flight — Iran’s Ah-madinejad mocks Obama, ”TV series” nuke talks
SeattlePost
Hospital: Actress Brittany Murphy dies at age 32 — ActorCharlie Sheen arrested in Colorado — Charlie Sheen accusedof using weapon in Aspen
I. Flaounas (University of Bristol) January 19, 2010 39 / 57
Overview
1 Introduction
2 Automating News Content AnalysisVisualisation of MediasphereDemo: Found In Translation
3 FindingsDetection of BiasesPredicting Popular ArticlesThe Structure of the EU Mediasphere
4 Conclusions
I. Flaounas (University of Bristol) January 19, 2010 40 / 57
EU Mediasphere
Top-10 media outlets percountry
over the 27 EU countries
in 22 different languages
for a 6 months period
A total of 1.3M newsitems.
I. Flaounas (University of Bristol) January 19, 2010 41 / 57
EU Mediasphere
Top-10 media outlets percountry
over the 27 EU countries
in 22 different languages
for a 6 months period
A total of 1.3M newsitems.
What patterns can we find using modern AI techniques?
I. Flaounas (University of Bristol) January 19, 2010 41 / 57
The EU MediasphereCo-coverage network: We link two outlets if they share more stories thanexpected by chance (χ2
− scores).
I. Flaounas (University of Bristol) January 19, 2010 42 / 57
The EU MediasphereCo-coverage network: We link two outlets if they share more stories thanexpected by chance (χ2
− scores).
This network has 203 nodes and 6702 edges.I. Flaounas (University of Bristol) January 19, 2010 42 / 57
A bit sparser....
This network has 197 nodes, 3386 edges and 3 connected components.Singleton nodes are omitted.
I. Flaounas (University of Bristol) January 19, 2010 43 / 57
What kind of connected components are formed?
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
What kind of connected components are formed?
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
What kind of connected components are formed?
We go as sparse as possible with stopping criterion the modularitymaximization.
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
What kind of connected components are formed?
We go as sparse as possible with stopping criterion the modularitymaximization.The probability of two non-singleton nodes from the same country toend up in the same connected component is 82.9% (p < 0.001).
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
What kind of connected components are formed?
We go as sparse as possible with stopping criterion the modularitymaximization.The probability of two non-singleton nodes from the same country toend up in the same connected component is 82.9% (p < 0.001).Nationality is the major underline criterion of what stories mediaoutlets choose to publish.
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
What kind of connected components are formed?
We go as sparse as possible with stopping criterion the modularitymaximization.The probability of two non-singleton nodes from the same country toend up in the same connected component is 82.9% (p < 0.001).Nationality is the major underline criterion of what stories mediaoutlets choose to publish.
We will work on countries level rather than outlets level.
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
Which are the strongest connections between countries?
I. Flaounas (University of Bristol) January 19, 2010 45 / 57
Which are the strongest connections between countries?
We go as sparse as possible while keeping the network connected.
This network has 27 nodes and 112 edges.
I. Flaounas (University of Bristol) January 19, 2010 45 / 57
Can we explain relations of countries?
I. Flaounas (University of Bristol) January 19, 2010 46 / 57
Can we explain relations of countries?
We found significant (p < 0.001) correlation of countries’ media-contentsimilarity to their:
Geographical proximity — based on sharing of borders 33.86%
I. Flaounas (University of Bristol) January 19, 2010 46 / 57
Can we explain relations of countries?
We found significant (p < 0.001) correlation of countries’ media-contentsimilarity to their:
Geographical proximity — based on sharing of borders 33.86%
Economical proximity — based on trade volume 31.03%
I. Flaounas (University of Bristol) January 19, 2010 46 / 57
Can we explain relations of countries?
We found significant (p < 0.001) correlation of countries’ media-contentsimilarity to their:
Geographical proximity — based on sharing of borders 33.86%
Economical proximity — based on trade volume 31.03%
Cultural proximity — based on song contest votting patterns 32.05%
I. Flaounas (University of Bristol) January 19, 2010 46 / 57
Can we explain relations of countries?
We found significant (p < 0.001) correlation of countries’ media-contentsimilarity to their:
Geographical proximity — based on sharing of borders 33.86%
Economical proximity — based on trade volume 31.03%
Cultural proximity — based on song contest votting patterns 32.05%
UK Metro, Dec. 8, 2010: Countries that always vote for each other in theEurovision song contest, have a shared interest in news content, as well asterrible music, a study has shown [...]
I. Flaounas (University of Bristol) January 19, 2010 46 / 57
How ‘close’ are countries, based on common media
interests?We use χ2-scores as similarities and project countries in a 2D plane usingMultidimensional Scaling.
I. Flaounas (University of Bristol) January 19, 2010 47 / 57
How ‘close’ are countries, based on common media
interests?We use χ2-scores as similarities and project countries in a 2D plane usingMultidimensional Scaling.
I. Flaounas (University of Bristol) January 19, 2010 47 / 57
How ‘close’ are countries, based on common media
interests?We use χ2-scores as similarities and project countries in a 2D plane usingMultidimensional Scaling.
We colour the Eurozone membersin blue.
These countries are closer to thecentre, that is the averageEU-media content.
I. Flaounas (University of Bristol) January 19, 2010 48 / 57
Ranking of countries
Based on their deviation from average EU media content (in 26D space).
I. Flaounas (University of Bristol) January 19, 2010 49 / 57
Ranking of countries
Based on their deviation from average EU media content (in 26D space).
Rank Country Euro A.Year
1 France Y 19572 Austria Y 19953 Germany Y 19574 Greece Y 19815 Ireland Y 19736 Cyprus Y 20047 Slovenia Y 20048 Spain Y 19869 Slovakia Y 200410 Italy Y 195711 Belgium Y 195712 Luxembourg Y 195713 Bulgaria N 200714 Netherlands Y 1957
15 U. Kingdom N 197316 Finland Y 199517 Sweden N 199518 Poland N 200419 Estonia N 200420 Denmark N 197321 Portugal Y 198622 Malta Y 200423 Czech Republic N 200424 Romania N 200725 Latvia N 200426 Hungary N 200427 Lithuania N 2004
I. Flaounas (University of Bristol) January 19, 2010 49 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001Accession Year -49.32 0.009
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020Population 23.05 0.247
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020Population 23.05 0.247Area 15.63 0.435
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001Accession Year -49.32 0.009GDP 2008 44.75 0.020Population 23.05 0.247Area 15.63 0.435Population Density 7.45 0.712
The first three factors are significant (p < 0.05), while the rest are not.
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Discussion
EU media editors made independently a multitude of small editorialdecisions which shaped the contents of the EU mediasphere in a waythat reflects its deep geographic, economic and cultural relations.
Detecting these subtle signals in a statistically rigorous way would beout of the reach of traditional methods of social scientists.
This analysis demonstrates the power of the available methods forsignificant automation of media content analysis.
I. Flaounas (University of Bristol) January 19, 2010 51 / 57
Conclusions
Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.
I. Flaounas (University of Bristol) January 19, 2010 52 / 57
Conclusions
Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.
Research can be conducted across multiple languages/countries.
I. Flaounas (University of Bristol) January 19, 2010 52 / 57
Conclusions
Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.
Research can be conducted across multiple languages/countries.
The study can run for a long period of time period involving millionsof articles.
I. Flaounas (University of Bristol) January 19, 2010 52 / 57
Conclusions
Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.
Research can be conducted across multiple languages/countries.
The study can run for a long period of time period involving millionsof articles.
We challenge questions that could not be answered previously.◮ e.g. Which country’s media have the most negative coverage of
environmental news that also mention Barack Obama?
I. Flaounas (University of Bristol) January 19, 2010 52 / 57
Conclusions
Several tasks using modern AI techniques can be automated.◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.
Research can be conducted across multiple languages/countries.
The study can run for a long period of time period involving millionsof articles.
We challenge questions that could not be answered previously.◮ e.g. Which country’s media have the most negative coverage of
environmental news that also mention Barack Obama?
In the social sciences, the analysis of news media is done largely byhand in a hypothesis-driven fashion. Is it time for social sciences toalso adopt a data-driven research model?
I. Flaounas (University of Bristol) January 19, 2010 52 / 57
Future Work
Under development ideas:
Use of features such as images / audio / video.
How does SMT affect the supervised/unsupervised learning?
Compare the US and the EU mediaspheres
...
I. Flaounas (University of Bristol) January 19, 2010 53 / 57
Work from other members of the group
Suffix Tree - Detection of memes
Named Entities detection & disambiguation
Twitter - Events detection
Summarisation of news
Online learning algorithms for news annotation
I. Flaounas (University of Bristol) January 19, 2010 54 / 57
References 1
I. Flaounas, M. Turchi, O. Ali, N. Fyson, T. De Bie, N. Mosdell, J.Lewis, N. Cristianini: “The Structure of the EU Mediasphere”, PLoSONE, Vol. 5(12), pp. e14243, 2010.
I. Flaounas, M. Turchi, T. De Bie and N. Cristianini: “Inference andValidation of Networks”, ECML/PKDD, Springer, LNCS, Vol.5782(1), pp. 344–358, Bled, Slovenia, 2009.
O. Ali, I. Flaounas, T. De Bie, N. Mosdell, J. Lewis and N. Cristianini,“Automating News Content Analysis: An Application to Gender Biasand Readability”, JMLR W & CP: Workshop on Applications ofPattern Analysis (WAPA), Vol.11, pp. 36–43, Windsor, UK, 2010.
M. Turchi, I. Flaounas, O. Ali, T De Bie, T. Snowsill and N.Cristianini: “Found in Translation”, ECML/PKDD, Springer, LNCS,Vol. 5782(2), pp. 746–749, Bled, Slovenia, 2009.
I. Flaounas (University of Bristol) January 19, 2010 55 / 57
References 2
I. Flaounas, N. Fyson and N. Cristianini: “Predicting Relations inNews-Media Content among EU Countries”, 2nd InternationalWorkshop on Cognitive Information Processing (CIP), IEEE, pp.269–274, Elba, Italy, 2010.
E. Hensinger, I. Flaounas and N. Cristianini: “Learning thePreferences of News Readers with SVM and Lasso Ranking”,Artificial Intelligence Applications and Innovations, Springer, pp.179–186, Larnaca, Cyprus, 2010.
T. Snowsill, I. Flaounas, T. De Bie and N. Cristianini: “Detectingevents in a million New York Times articles”, ECML/PKDD,Springer, LNCS, Vol. 6323(3), pp. 615–618, Barcelona, Spain, 2010.
I. Flaounas, M. Turchi and N. Cristianini: “Detecting Macro-Patternsin the European Mediasphere”, IEEE/WIC/ACM InternationalConference on Web Intelligence and Intelligent Agent Technology, pp.527–530, Milano, Italy, 2009.
I. Flaounas (University of Bristol) January 19, 2010 56 / 57
More info at: http://mediapatterns.enm.bris.ac.uk
Thank you!
I. Flaounas (University of Bristol) January 19, 2010 57 / 57
top related