predicting document creation times in news citation networks · news citation network overview news...
TRANSCRIPT
![Page 1: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/1.jpg)
Predicting Document Creation Timesin News Citation Networks
Andreas Spitz1, Jannik Strötgen2, and Michael Gertz1
April 23, 2018 — TempWeb 2018, Lyon
1 Database Systems Research Group 2 Bosch Center for Artificial IntelligenceHeidelberg University, Germany Germany
![Page 2: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/2.jpg)
Hm, when did this happen again?
1
![Page 3: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/3.jpg)
News Citation Networks
![Page 4: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/4.jpg)
News Citation Network Extraction
2
![Page 5: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/5.jpg)
News Citation Network Overview
News articles from RSS feeds:
I Politics and business feeds
I 34 English news outlets(USA, UK, AUS, CAN, GER, CHN, QAT)
I 2 years (Nov 2015 - Oct 2017)
I 244.6 thousand articles
I 367.2 thousand edges
Used data:
I Hyperlinks in the article body
I Publication dates
I Temporal expressions3
![Page 6: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/6.jpg)
News Outlet Statistics (sample)
short news outlet days 〈articles〉 〈temp exp〉 otherin otherout
AT The Atlantic 334 7.2 10.5 16.7 50.6BBC British Bc. Corp. 730 8.1 6.5 19.1 8.0DW Deutsche Welle 334 1.2 6.1 48.1 5.9FOX Fox News 548 2.7 9.8 0.0 0.0NPR National Public Radio 334 0.4 8.4 63.6 58.5NY The New Yorker 548 3.0 13.2 33.5 30.6NYT New York Times 669 23.8 10.7 26.8 4.7SMH Sydney Morn. Herald 548 2.3 7.0 3.0 51.9WP Washington Post 548 62.7 9.4 13.7 5.1
4
![Page 7: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/7.jpg)
Evolution of Network Metrics
clustering coefficient average path length
average degree undirected diameter
2016−01 2016−07 2017−01 2017−07 2016−01 2016−07 2017−01 2017−07
0
20
40
60
5
10
15
1
2
3
0.0
0.2
0.4
0.6
days
mea
sure
val
ue
network aggregated politics business
5
![Page 8: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/8.jpg)
Exploring Citation Chains
6
![Page 9: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/9.jpg)
Article Publication Time Prediction
![Page 10: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/10.jpg)
Task Definition: Publication Time Prediction
7
![Page 11: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/11.jpg)
Available News Citation Network Data
Predict article publication times from:
I Citation network topology
I Publication dates of adjacent articles
I Temporal expressions in adjacent articles
I Not the metadata of the article itself
I Not the article content
8
![Page 12: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/12.jpg)
Available News Citation Network Data
Predict article publication times from:
I Citation network topology
I Publication dates of adjacent articles
I Temporal expressions in adjacent articles
I Not the metadata of the article itself
I Not the article content
8
![Page 13: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/13.jpg)
Feature Extraction
![Page 14: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/14.jpg)
Network Topology Features
Node degree-based features:
I Incoming degree
I Outgoing degree
I Undirected degree
Density-based features:
I Undirected local clustering coe�icient
Centrality-based features:
I Betweenness centrality
I Incoming closeness centrality
I Outgoing closeness centrality
I Page Rank centrality
9
![Page 15: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/15.jpg)
Network Topology Features
Node degree-based features:
I Incoming degree
I Outgoing degree
I Undirected degree
Density-based features:
I Undirected local clustering coe�icient
Centrality-based features:
I Betweenness centrality
I Incoming closeness centrality
I Outgoing closeness centrality
I Page Rank centrality
9
![Page 16: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/16.jpg)
Network Topology Features
Node degree-based features:
I Incoming degree
I Outgoing degree
I Undirected degree
Density-based features:
I Undirected local clustering coe�icient
Centrality-based features:
I Betweenness centrality
I Incoming closeness centrality
I Outgoing closeness centrality
I Page Rank centrality
9
![Page 17: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/17.jpg)
Temporal Network Features
10
![Page 18: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/18.jpg)
Temporal Expression Features
Correlation of temporal expressions:
I good with publication dates ofreferencing articles (incoming edges)
I bad with publication dates ofreferenced articles (outgoing edges)
11
![Page 19: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/19.jpg)
Temporal Expression Features
Correlation of temporal expressions:
I good with publication dates ofreferencing articles (incoming edges)
I bad with publication dates ofreferenced articles (outgoing edges)
11
![Page 20: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/20.jpg)
Missing Features and Imputation
Missing features
I 30.8% of feature values are missing
I 89.6% of articles are missing at least one feature
Imputation of missing values
I Column mean of the feature
12
![Page 21: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/21.jpg)
Missing Features and Imputation
Missing features
I 30.8% of feature values are missing
I 89.6% of articles are missing at least one feature
Imputation of missing values
I Column mean of the feature
12
![Page 22: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/22.jpg)
Evaluation
![Page 23: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/23.jpg)
Regression Methods
Used regression methods:
I BASE: Baseline (average publication date of adjacent articles)
I LR: Linear regression
I BAY: Bayesian ridge regression (Laplace model)
I RF: Random forest
I GB: Gradient boosting (Laplace distribution, decision trees)
I SVM: Support vector machine (radial kernel)
I NN: Neural network (feedforward, one hidden layer)
13
![Page 24: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/24.jpg)
Evaluation Results: Mean Absolute Error (days)
BASE LR BAY NN RF GB SVM
all 66.72 60.46 59.61 26.88 24.98 22.66 26.19in 88.88 66.48 87.55 34.03 32.25 27.49 32.29
out 87.32 59.54 40.24 32.52 30.10 26.68 30.77in+out 18.68 55.45 54.95 12.62 11.23 12.76 14.31
14
![Page 25: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/25.jpg)
Distribution of Absolute Errors
out in+out
all in
BASE LR BAY NN RF GB SVM BASE LR BAY NN RF GB SVM
0
50
100
150
200
250
0
50
100
150
200
250
regression method
abso
lute
err
or (
days
)
method BASE LR BAY NN RF GB SVM
15
![Page 26: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/26.jpg)
Recall by Varying Absolute Error
out in+out
all in
0 20 40 60 0 20 40 60
0
25
50
75
100
0
25
50
75
100
absolute error (days)
reca
ll (p
erce
ntag
e of
pre
dict
ions
< a
bsol
ute
erro
r)
method BASE LR BAY NN RF GB SVM
16
![Page 27: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/27.jpg)
Feature Importance: Random Forest
●
●●●
●●
●
●
Feature importance: random forestm
ax (T
out)
min
(Tin)
µ (T ou
t)µ
(T in)
min
(Tou
t)m
ax (T
in)
max
(Xin)
µ (X
in) c pr
σ (T ou
t)σ
(Xin)
c cl,o
utsp
an (T
out)
σ (T in
)sp
an (T
in)
min
(Xin)
span
(Xin)
c cl,in
min
(Dis
t)de
g out
µ (D
ist)
deg in
deg al
lm
ax (D
ist)
c btw cc
σ (D
ist)
10−3
10−2
10−1
100
rela
tive
impo
rtan
ce
Feature type: ● network topology temporal expression temporal network
17
![Page 28: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/28.jpg)
Feature Importance: Gradient Boosting
●
●
●
●
●
●●
●
Feature importance: gradient boostingm
ax (T
out)
min
(Tin)
deg ou
tµ
(T out)
min
(Dis
t)de
g in c prσ
(T out)
σ (T in
)µ
(T in)
deg al
lsp
an (T
in)
max
(Tin)
µ (X
in)
min
(Tou
t)µ
(Dis
t)c bt
wm
ax (X
in)
max
(Dis
t)sp
an (X
in)
σ (X
in)
span
(Tou
t)m
in (X
in)
c cl,o
utσ
(Dis
t)c cl
,in cc
10−5
10−4
10−3
10−2
10−1
100
rela
tive
impo
rtan
ce
Feature type: ● network topology temporal expression temporal network
18
![Page 29: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/29.jpg)
Summary & Resources
![Page 30: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/30.jpg)
Summary
News citation networks:
I Focus on anchored links inside the article body
I Constructed like a citation network between articles
Publication date prediction:
I Can be framed as a regression problem
I Average prediction error of 3 weeks
I Temporal network features are most discriminative
19
![Page 31: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/31.jpg)
Resources
Data and implementation are available online:
I [data] News citation network (including URLs)
I [data] Temporal annotations
I [code] Publication date prediction
https://dbs.ifi.uni-heidelberg.de/resources/data/
20
![Page 32: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets](https://reader033.vdocuments.us/reader033/viewer/2022052006/601a368faa2d7d7f7415c56a/html5/thumbnails/32.jpg)
Resources
Data and implementation are available online:
I [data] News citation network (including URLs)
I [data] Temporal annotations
I [code] Publication date prediction
https://dbs.ifi.uni-heidelberg.de/resources/data/
20