big data bbva research · big data at bbva research index 01 02 opportunities in the digital era....
Post on 05-Oct-2018
221 Views
Preview:
TRANSCRIPT
February 2018
Big Data at BBVA Research Big Data Workshop on economics and finance.
Bank of Spain
Alvaro Ortiz, Tomasa Rodrigo and Jorge Sicilia
Big Data at BBVA Research
Index
01
02
Opportunities in the digital era. Big Data at BBVA Research
Big Data & Big Models : Applying Big Data at BBVA Research
BBVA Data: Monitoring consumption using BBVA
transactional data: the Spanish Retail Sales Index
03
Policy Documents: Analyzing Central Banks’ policy using
text mining and sentiment analysis: the case of Turkey
News Data (GDELT): Tracking Chinese
Vulnerability Sentiment
Annex
Big Data at BBVA Research
01 Opportunities in the digital era.
Big Data at BBVA Research
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Traditional data could not answer some relevant questions…
4
Social awareness and the Arab Spring
Political events and social reaction
Natural disasters and epidemics
… preventing us to measure their economic impact…
… in a world with increasing risks and uncertainty
The use of Big Data and Data science techniques allows
us to quantify these trends
Big Data & Big Models at BBVA Research Big Data at BBVA Research
New framework in the digital era…
5
Novel data-driven computational approaches are needed to exploit the
new opportunities in the new digital era where data can be used to study
the world in real time, both at micro or macro level
New answers to
old questions
Better and faster
infrastructure
New data
availability
Higher computational abilities
to face more data granularity
Combination of historical
data with real time data
Advanced data science
techniques and algorithms
Big Data & Big Models at BBVA Research Big Data at BBVA Research
… which needs the development of new competences to take
advantage of it
New data may end up changing the way in which economists approach empirical questions
and the tools they use to answer them 6
Making the right
questions
Developing the
data management
and programming
capabilities to work
with large-scale
datasets
Deepening the
statistical and
econometric
skills to analyze
and deal with
high-dimensional
data
Interpreting the
results:
summarize,
describe and
analyze the
information
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Big Data at BBVA Research
Our results Our datasets Our work
• We analyze geopolitical,
political, social and
economic issues using
large- scale databases and
quantitative data-driven
methods rather than just
qualitative introspection
• Media data to exploit news
intensity, geographic density
of events (location
intelligence) and emotions
across the world (sentiment
analysis)
• BBVA aggregate and
anonymized data from
clients digital footprint
• Data from the web (Central
Banks’ reports among
others)
• We are at the research
frontier in the geopolitical
and economic area
contributing to the
innovation and increasing
our internal and external
reach
7
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Internal and external diffusion
External
Institutions
BBVA
Research
BBVA
External
institutions
8
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Our working process
GDELT
BBVA data
Google search
Web
Clean,
Aggregate
transform
and model
the data
Fuse,
visualize
& analyze
the data
BigQuery and
Amazon
Redshift
Databases SaaS Analysis Visualization
9
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Mean takeaways working with Big Data
10
Complement and enrich our traditional
databases with high dimensional data:
It helps us to …
• Quantifying new trends and exploiting
new dimensions
• Having timely answers on the impact
of different events
• Improving our models performance at
nowcasting
… but still some challenges
• Data challenges: missing data,
data sparsity, data quality,…
• There’s not enough time horizon to
improve our models performance
at forecasting
-1
0
1
2
3
4
5
6
7
10
/08/2
01
711
/08/2
01
712
/08/2
01
714
/08/2
01
715
/08/2
01
716
/08/2
01
717
/08/2
01
718
/08/2
01
720
/08/2
01
721
/08/2
01
722
/08/2
01
723
/08/2
01
725
/08/2
01
726
/08/2
01
727
/08/2
01
728
/08/2
01
731
/08/2
01
701
/09/2
01
703
/09/2
01
704
/09/2
01
705
/09/2
01
706
/09/2
01
708
/09/2
01
709
/09/2
01
710
/09/2
01
711
/09/2
01
713
/09/2
01
714
/09/2
01
715
/09/2
01
717
/09/2
01
718
/09/2
01
719
/09/2
01
720
/09/2
01
721
/09/2
01
7A u g u s t 2 0 1 7 S e m t e m b e r 2 0 1 7
10
17
15
20
25
31
03
05
08
12
22
27
10
13
20
15
18
Barcelona attacks
17 Aug. 2017
Conflict Intensity Map 2017 Refugees Flows Map 2015-17 Barcelona Instability index 2017 Chinese slowdown: country network
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Data treatment and robustness check
11
To face with new and high dimensional data
Data treatment and analysis:
Data cleaning, missing values,
outlier detection, high
heterogeneity, sparsity,…
New methodologies to face data
challenges: dimensionality
reduction, clustering,
regularization,…
Robustness check:
Cross-check of Big
Data outcome with traditional data
and methodologies
1 2
Massive and unstructured datasets:
Importance of making the right
questions
!
Ebola Outbreak:
WHO and GDELT
Protectionism:
GTA and GDELT
Retail sales:
INE and BBVA
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Political, Geopolitical Social Indexes (Political Indexs)
Color Maps NAFTA Topics (Nafta Project)
Politics & Financial Networks (Political Netwoks)
Mix Hard data & Sentiment & VAR models (CBSI and Turkey Sentiment Indexes)
Geographical Analysis Housing Prices (sentiment on Housing Prices)
Measuring Sentiments (sentiment Analysis on Economy and Society)
Financial Stability & Macroprudential (ECB & FED FS index by FED Board)
Monetary & Stability tones by Central
Banks
12
Our products
Big Data at BBVA Research
02 Big Data & Big Models:
Applying Big Data at BBVA Research
Big Data & Big Models at BBVA Research Big Data at BBVA Research
BBVA
Data
(“ Transactions
Geo-referenced Data”)
International
News (GDELT)
(Narratives
& Sentiment)
Policy
Reports
(Topics , Networks
& Sentiment)
Big Data & Models at BBVA Research: Some Examples
Hard
Data
Hard + Market + Text
Data Text
Data
12
1 3 2
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Real Time Indicator of Retail Sales
(Spain)
13
BBVA
Data
(“ Transactions
Geo-referenced Data”)
1
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Retail trade sector dynamic leads the evolution of the aggregate
consumption, which represents a high share of the GDP
Having accurate estimations on the
evolution of the retail trade sector activity
is of main importance given it is a key
indicator about the current economic
situation
Spanish case: Retail sales vs. Consumption
expenditure of households (%, YoY)
Source: BBVA Research and BBVA Data & Analytics from INE data 14
-12
-9
-6
-3
0
3
6
Ma
r-0
4
De
c-0
4
Sep-0
5
Jun-0
6
Ma
r-0
7
De
c-0
7
Sep-0
8
Jun-0
9
Ma
r-1
0
De
c-1
0
Sep-1
1
Jun-1
2
Ma
r-1
3
De
c-1
3
Sep-1
4
Jun-1
5
Ma
r-1
6
De
c-1
6
Sep-1
7
Retail sales Consumption expenditure of households
Big Data & Big Models at BBVA Research Big Data at BBVA Research
The Retail Sales Index: Matching - internal and external sources
INE: Instituto Nacional de Estadística
5 distribution classes:
• large chain stores
(25 or more premises, and
50 or more employees)
• department stores
(sales area greater than
or equal to 2.500m2)
EXTERNAL TAXONOMY - SPAIN
• service stations
• single retail stores
(one premise)
• small chain stores
(2-24 premises &
<50 employees)
INTERNAL TAXONOMY - SPAIN
CATEGORY (FASHION)
SUBCATEGORY (FASHION-BIG)
RAMO / GIRO
(TEXTILES AND
CLOTHING)
CIF / RFC (CADENA ZARA)
FUC / AFILIACIÓN (ZARA, Gran Vía,
Madrid)
POS ID (TPV)
REDSHIFT
More information can be found in the annex 15
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Select variables
Query Data
Data cleaning
Outlier detection
Normalize data
Testing data
Automate process
Data extraction, cleaning and transformation
16
Big Data & Big Models at BBVA Research Big Data at BBVA Research
A “High Definition” Retail Sales Index (RTI) for Spain* (and Mexico) (BBVA consumption indicator for the optimal allocation of BBVA’s resources and products)
What “HIGH DEFINITION(*)” means here:
Using BBVA data, we replicate national figures, gaining frequency…
High granularity:
Dynamics down to subnational level
Ultra High Frequency:
Dynamics up to sub-monthly frequency
Multi Dimensional:
More detailed socioeconomic features
RTI–BBVA Index, in millions of euros and daily basis Comparison Retail Sales (RTI) - INE and BBVA on monthly
basis (standardized monthly growth)
19 Bodas et al. (2018, Forthcoming ). Measuring retail trade through card’s transaction data.
Source: BBVA Research and BBVA Data & Analytics
-3
-2
-1
0
1
2
3
Jan-1
3
Jul-1
3
Jan-1
4
Jul-1
4
Jan-1
5
Jul-1
5
Jan-1
6
Jul-1
6
Jan-1
7
Jul-1
7
Jan-1
8
BBVA & CX data merge BBVA-RTI INE-RTI
14
15
16
17
18
Ja
n-1
3
Apr-
13
Ju
l-13
Oct-
13
Ja
n-1
4
Apr-
14
Ju
l-14
Oct-
14
Ja
n-1
5
Apr-
15
Ju
l-15
Oct-
15
Ja
n-1
6
Apr-
16
Ju
l-16
Oct-
16
Ja
n-1
7
Apr-
17
Ju
l-17
Oct-
17
Ja
n-1
8
Log(total) Sundays Saturdays
Big Data & Big Models at BBVA Research Big Data at BBVA Research
BBVA transactions December 2017 (% yoy)
Going further from national figures, the retail sales at regional level …
20 Source: BBVA Research and BBVA Data & Analytics
Basque Country
Álava Vizcaya
-40
-20
0
20
40
Jan-1
3
May-
13
Sep-1
3
Jan-1
4
May-
14
Sep-1
4
Jan-1
5
May-
15
Sep-1
5
Jan-1
6
May-
16
Sep-1
6
Jan-1
7
May-
17
Sep-1
7
Jan-1
8
BBVA transactions Retail sales
-40
-20
0
20
40
Jan-1
3Ju
l-13
Jan-1
4Ju
l-14
Jan-1
5Ju
l-15
Jan-1
6Ju
l-16
Jan-1
7Ju
l-17
Jan-1
8
BBVA transact ions
-40
-20
0
20
40
Jan-1
3Ju
l-13
Jan-1
4Ju
l-14
Jan-1
5Ju
l-15
Jan-1
6Ju
l-16
Jan-1
7Ju
l-17
Jan-1
8
BBVA transactions
-40
-20
0
20
40
Jan-1
3Ju
l-13
Jan-1
4Ju
l-14
Jan-1
5Ju
l-15
Jan-1
6Ju
l-16
Jan-1
7Ju
l-17
Jan-1
8
BBVA transact ions
Guipúzcoa
Big Data & Big Models at BBVA Research Big Data at BBVA Research
… and by sector and “size” of activity
21 Source: BBVA Research and BBVA Data & Analytics
Single retail stores Small chain stores
Large chain stores Department stores
BBVA retail sales by distribution classes (standardized monthly growth)
-3
-2
-1
0
1
2
3
Ja
n-1
3
Ju
l-13
Ja
n-1
4
Ju
l-14
Ja
n-1
5
Ju
l-15
Ja
n-1
6
Ju
l-16
Ja
n-1
7
Ju
l-17
Ja
n-1
8
-3
-2
-1
0
1
2
3
Ja
n-1
3
Ju
l-13
Ja
n-1
4
Ju
l-14
Ja
n-1
5
Ju
l-15
Ja
n-1
6
Ju
l-16
Ja
n-1
7
Ju
l-17
Ja
n-1
8
-3
-2
-1
0
1
2
3Ja
n-1
3
Ju
l-13
Ja
n-1
4
Ju
l-14
Ja
n-1
5
Ju
l-15
Ja
n-1
6
Ju
l-16
Ja
n-1
7
Ju
l-17
Ja
n-1
8
-3
-2
-1
0
1
2
3
Ja
n-1
3
Ju
l-13
Ja
n-1
4
Ju
l-14
Ja
n-1
5
Ju
l-15
Ja
n-1
6
Ju
l-16
Ja
n-1
7
Ju
l-17
Ja
n-1
8
BBVA retail sales index by sector of activity (median ticket in December 2017)
0
20
40
60
80
100
120
Hea
lth
Oth
er
se
rvic
es
Ba
rs a
nd
re
sta
ura
nts
Bo
oks,
pre
ss a
nd m
ag
azin
es
Fo
od
Leis
ure
and
en
tert
ain
me
nt
Care
an
d b
eau
ty
Hom
e
Tra
nspo
rt
Tra
ve
lling
Larg
e s
tore
s
Rea
l e
sta
te
Fa
sh
ion
Acco
mm
od
atio
n
Au
tom
otive
Sp
ort
s a
nd
to
ys
Te
ch
no
logy
Median ticket 25 percentile 75 percentile
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Vulnerability Sentiment Indicator (CSVI)
(China)
20
International
News (GDELT)
(Narratives
& Sentiment)
2
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Tracking China Vulnerability in Real Time: Mixing Data and Sentiment
… provide accurate
information…
but at lower frequencies
and with delays…
… Real Time but limited
Information …. also
influenced by global
factors…
… complementing Hard-
Soft-Markets in Real
Time “sentiment” on
Special topics not quotes
Hard Data
Indicators
Market
Indicators
Sentiment
Indicators
21
Big Data & Big Models at BBVA Research Big Data at BBVA Research
A Balanced set of Information in the Database: Hard Data, Markets
and Sentiment
Chinese Vulnerability Sentiment Index (CVSI): components and evolution
Source: www.gdelt.org & BBVA Research 22
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Results show the Importance of Sentiment
Chinese Vulnerability Sentiment Index (CVSI): Weights
SOE Vulnerability Housing Bubble Shadow Banking FX Speculative Pressure
Variable (Type) Weight Variable (Type) Weight Variable (Type) Weight Variable (Type) Weight
Tota Profits (HD) 19.63 New Construction (HD) 16.37 Wenzhou Index (M) 16.58 Currency Exchange Rate (S) 19.94
Institutional Reform & SOEs (S) 12.37 Mortgages Loans (HD) 14.57 WMP Yields (M) 13.63 Exchange Rate Policy (S) 17.95
Debt & SOE (S) 11.92 Land Reform (S) 12.62 Infrastructure_funds (S) 10.92 Macroprudential_Policy (S) 15.33
Liabilities (HD) 10.62 Housing Price (HD) 11.6 NPL ratio (HD) 9.46 Hinchon Index (M) 13.84
Local Goverment & SOE (S) 9.75 Housing Construction (S) 10.59 State & Finantial Inst (S) 8.95 CNY Currency (M) 11.92
Industry Policy (S) 9.5 Housing_Prices (S) 10.05 Banking_Regulation (S) 7.28 Capital Account (S) 10.05
Resource Mis. & P. Failure (S) 8.18 Housing Policy & Institutions (S) 8.94 Financial Vulnerability (S) 6.82 CNH Currency (M) 8.73
SOE (S) 7.15 Housing Finance (S) 7.83 Asset_Management (S) 5.62 Illicit Financial Flows (S) 1.58
Industry Laws & Regulation (S) 5.28 Housing Markets (S) 6.71 Financial Sector Instability (S) 5.35 Foreign Reserves (S) 0.6
Resource Misaloc. & SOE (S) 5.61 GICS Housing Index (M) 0.36 Bank Capital Adequacy (S) 4.6 Currency Reserves (S) 0.06
Real State Investment (HD) 0.31 Non Bank Financial Inst (S) 4.35
Monetary & F.Stability (S) 3.54
Acceptances (HD) 2.22
TSF Aggregate New (HD) 0.57
Entrusted Loans (HD) 0.13
%of Sentiment in Component (S) 69.76 %of Sentiment in Component (S) 56.74 %of Sentiment in Component (S) 57.43 %of Sentiment in Component (S) 48.27
% of Variance by 1st PC 63.2 % of Variance by 1st PC 65.8 % of Variance by 1st PC 59.5 % of Variance by 1st PC 78.99
in the CSVI 29.18 Weight in the CSVI 26.05 Weight in the CSVI 23.12 Weight in the CSVI 21.64
23 Source: BBVA Research. Ortiz et al. (2017). Tracking Chinese Vulnerability in Real Time Using Big Data
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Bayesian Model Averaging Robustness check confirms the relevance
of Sentiment in explaining Market Risk proxies
total.profits.yoy
resource misallocations and policy failure&soe
industry policy
debt
resource misallocations and policy failure
economic transparency
mortages.loan
l&.sales
econ housing prices
price controls
entrusted.loans
acceptances
financial sector instability
state financial institutions
cny.curncy
hicnhon.index
foreign.reserves
cnh.curncy
exchange rate policy
banking regulation
infrastructure funds
institutional reform&soe
wmp.yields
state owned enterprises
govyield
land reform
realest.invest
housing markets
housing finance
liabilties.assets
capital account
econ currency reserves
local government
bank capital adequacy
econ currency exchange rate
new.construction
asset management
illicit financial flows
financial vuln and risks
gics.housing.index
local government&soe
MODEL INCLUSION BASED ON THE BEST 1000 MODELS
Bayesian Model Averaging Results (PIP Inclusion after 1000 models)
Dependent variable: CDS Spread Type of data Component PIP Posterior Mean Posterior standard deviation SD
1 Total profits Hard Data SOE 1.000 -0.358 0.042
2 Institutional reform&SOE Sentiment SOE 1.000 -0.089 0.019
3 Industry policy Sentiment SOE 1.000 -0.112 0.015
4 Economic transparency Sentiment Shadow Banking 1.000 -0.069 0.015
5 mortages loan Hard Data Housing bubble 1.000 1.014 0.105
6 housing f inance Sentiment Housing bubble 1.000 0.086 0.016
7 NPL ratio Financial Shadow Banking 1.000 -0.976 0.130
8 Acceptances Financial Shadow Banking 1.000 -0.264 0.053
9 CNY curncy Financial FX 1.000 -0.811 0.086
10 Econ currency reserves Sentiment FX 1.000 0.129 0.020
11 Exchange rate policy Sentiment FX 1.000 -0.113 0.017
12 Illicit f inancial f low s Sentiment FX 1.000 0.063 0.015
13 Industry law s and regulations Sentiment SOE 0.995 0.062 0.016
14 Hicnhon index Financial FX 0.994 0.080 0.022
15 State ow ned enterprises Sentiment SOE 0.953 -0.055 0.020
16 Econ housing prices Sentiment Housing bubble 0.927 0.058 0.025
17 Financial sector instability Sentiment Shadow Banking 0.908 0.089 0.038
18 Wenzhou index Financial Shadow Banking 0.860 0.071 0.040
19 Asset managment Sentiment Shadow Banking 0.817 0.042 0.025
20 Banking regulation Sentiment Shadow Banking 0.791 0.032 0.020
21 Macroprudential policy Sentiment FX 0.716 -0.030 0.023
22 Housing policy and institutions Sentiment Housing bubble 0.706 0.030 0.023
23 New construction Hard Data Housing bubble 0.697 0.045 0.035
24 Entrusted loans Financial Shadow Banking 0.674 -0.072 0.058
25 Real Estate Investment Hard Data Housing bubble 0.655 0.055 0.048
26 Non bank Financial institutions Sentiment Shadow Banking 0.655 -0.029 0.025
Bayesian Model Averaging Results: X with PIP>50% (PIP Inclusion after 1000 models)
Source: BBVA Research 24
Big Data & Big Models at BBVA Research Big Data at BBVA Research
The index shows that Vulnerability sentiment has been improved
since the authorities implemented some policies
Chinese Vulnerability Sentiment Index (CVSI) (Evolution of the “Tone” or “Sentiment”. Lower values indicate a deterioration of sentiment and higher vulnerability)
Low volatility and positive sentiment High volatility and negative sentiment
Source: www.gdelt.org & BBVA Research
-3
-2
-1
0
1
2
3
Ma
r-1
5
Apr-
15
Ma
y-1
5
Jun-1
5
Jul-1
5
Aug-1
5
Sep-1
5
Oct-
15
No
v-1
5
De
c-1
5
Jan-1
6
Feb
-16
Ma
r-1
6
Apr-
16
Ma
y-1
6
Jun-1
6
Jul-1
6
Aug-1
6
Sep-1
6
Oct-
16
No
v-1
6
De
c-1
6
Jan-1
7
Feb
-17
Ma
r-1
7
Apr-
17
Ma
y-1
7
Jun-1
7
Jul-1
7
Aug-1
7
Sep-1
7
Oct-
17
No
v-1
7
De
c-1
7
Jan-1
8W
ors
enin
g S
entim
ent
(Hig
her V
uln
era
bility
) Im
pro
vin
g S
entim
ent
(Lo
wer V
uln
era
bility
)
Stock
Market
Crash
"Black Monday"
stock market crash,
trade halted
PMI falls
to 4Y lows
RMB enters IMF's
SDR basket
NPC meeting 3% Devaluation
NPC Accepts
Lower Growth
Target
Neutral Area +- 0.5 std
Moody’s
Downgrade
S&P’s
Downgrade P
olic
y In
terv
en
tio
n
25
Big Data & Big Models at BBVA Research Big Data at BBVA Research
The performance of the components has not been uniform…
CVSI: SOEs Component CVSI: Shadow Banking Component
CVSI: Real State Component CVSI: External Component
26
Big Data & Big Models at BBVA Research Big Data at BBVA Research
The role of language: “What” & “Where” you read matters
Source: www.gdelt.org & BBVA Research
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
-2.0
-1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
ma
r-1
5
jun
-15
sep
-15
dic
-15
ma
r-1
6
jun
-16
sep
-16
dic
-16
ma
r-1
7
BBVA-GB Monthly GDP
Economic Sentiment (Turkish Media)
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
-2.0
-1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
ma
r.1
5
jun
.15
sep
.15
dic
.15
ma
r.1
6
jun
.16
sep
.16
dic
.16
ma
r.1
7
BBVA-GB Monthly GDP
Economic Sentiment (English Media)
China CSVI: English vs Chinese (index)
Turkey Econ. Sentiment: English vs Turkish News (% YoY and index)
-2.5
-1.5
-0.5
0.5
1.5
2.5
Ma
r-1
5
Ma
y-1
5
Jul-
15
Se
p-1
5
Nov-1
5
Jan
-16
Ma
r-1
6
Ma
y-1
6
Jul-
16
Se
p-1
6
Nov-1
6
Jan
-17
Ma
r-1
7
Ma
y-1
7
Jul-
17
Se
p-1
7
Nov-1
7
Jan
-18
Chinese Vulnerability Index (news in Chinese)
Chinese Vulnerability Index (news in English)
Amplifier
Divergence
27
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Some of the tools developed in the analysis can be used to track
vulnerability in a very granular way…
China SOE Sentiment Diffusion Map (sentiment on SOE)
Source: www.gdelt.org & BBVA Research
Geographical Analysis Housing Prices (sentiment on Housing Prices)
Soe CompanyChina International Intellectech
Dongfang Electric
China Energy Conservation Environmental Protection Group
China Aerospace Science
China Coal Tech
Harbin Electric
China Datang
China Guodian
China Electronics Co
Dongfeng Motor
China Electronics Technology Group Corporation
China Minmetal
China Southern Pow er Grid
China Telecom
China National Travel Service
Sinochem
China National Chemical Engineering
China United Netw ork
China State Construction Engineering
China Huaneng
China Shipbuilding Industry Corp
China Nuclear Energy
Chalco
Sinopec
China Mobile
China Shipbuilding Corp
China National Aviation Holding
China National Chemical Co
China Communications Construction
Sinosteel Corporation
Shenhua Group
China National Building Materials Group
Ansteel
Cofco
China National Petroleum Corporation
China National Salt
Chinese State Grid Corporation
China First Heavy Industries
Cnooc
China Nonferrous Metals
China National Coal Group
China Grain Reserves
2016
JA N
*2015
FEBM A R A PR M A Y JU N JU L A U G SEP OC T N OV D EC JA N A U G SEP OC T N OV D ECM A R A PR M A Y JU N JU L FEBJA N D EC
2017
OC TSEPA U GM A R A PR M A Y JU N JU L N OV
28
Big Data & Big Models at BBVA Research Big Data at BBVA Research
29
Policy
Reports
(Topics , Networks
& Sentiment)
3
Monetary Policy Topics & Sentiment
(Turkey)
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Text as Data: “Natural Language Processing (NLP) & Text Mining”
for analysis of the Central Bank of Turkey
32
We use communication reports of the Central
Bank of the Republic of Turkey CBRT from 2006
to October 2017
We Analyze “What” the CBRT is talking about
through Latent Dirichlet Allocation (LDA) and
Dynamic Topic Models (DTM)
We apply network analysis to understand
Monetary Policy Complexity
We check “How” the Central Bank talks by
using Sentiment Analysis (Dictionary Assisted)
We design some analytical tools to understand
the Monetary Policy of Turkey through the official
documents
Big Data & Big Models at BBVA Research Big Data at BBVA Research
External databases: web scrapping and NPL techniques
Information extraction Pre-Processing
and text parsing Transformation Text mining and NPL Sentiment analysis
• Documents
• Web pages
• Extract words
• Identify parts of
speech
• Tokenization and
multi-word tokens
• Stopword Removal
• Stemming
• Case-folding
• Text filtering
• Indexing to quantify
text in lists of term
counts
• Create the
Document-term
matrix
• Weighting matrix
• Factorization (SVD)
• Analysis and
Machine learning
• Topics extraction
(LDA)
• Clustering
• Modelling (STM and
DTM)
• Apply sentiment
dictionaries
• Semantic analysis
and classification
• Clustering
More information can be found in the annex 33
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Inflation Global Flows Monetary Policy
First we identify the topics: Word clouds will help us to understand and
identify topics… here there is a big room for the Researcher
Each word cloud represents the probability distribution of words within a given topic. The size
of the word and the color indicates its probability of occurring within that topic 34
Activity
Big Data & Big Models at BBVA Research Big Data at BBVA Research
What’s the CBRT talking about? We aggregate topics in groups…
to see the “dynamics” of Central Bank Communication over time…
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
200
6
200
7
200
8
200
9
201
0
201
1
201
2
201
3
201
4
201
5
201
6
201
7
Global Flows Economic Activity Labor Market
Fiscal &Structural Policies Inflation Monetary Policy
Other The Global Capital flows
are increasingly important
Economic Activity discussion
has returned to the fore
Employment issues increased
Relevance since the crisis
Discussions on Structural
Policies remain at minimum*
Inflation remains stable
Monetary Policy maintains its share
but increases in “Stress” periods
Central Bank of Turkey Topics Evolution (in % of total)
Source: BBVA Research. Ortiz et al (2017). How do the EM Central Bank talk? A Big Data approach to the Central Bank of Turkey 33
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Networks can be very useful to understand how the Central Banks
elaborate their strategy and the interconnectedness of topics
The network of the estimated and correlated topics using STM. The nodes in the graph represent the identified topics. Node size is proportional to the number of words in the corpus
devoted to each topic (weight). Node color indicates clusters using a community detection algorithm called modularity developed by Blondel et al (2008). Topics for which labeling is
Unknown are removed from the graph in the interest of visual clarity. Edges represent words that are common to the topics they connect (co-ocurrence of words between topics).
Edge width is proportional to the strength of this co-ocurrence between topics. 36
Full Fledge Inflation Target
2006-09
The global financial crisis
2010-15
In search of price stability
2016-17
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Disentangling communication policy by topic: the monetary policy
case
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
200
6
200
7
200
8
200
9
201
0
201
1
201
2
201
3
201
4
201
5
201
6
201
7
Global Flows Economic Activity Labor Market
Fiscal &Structural Policies Inflation Monetary Policy
Other
0
0.1
0.2
0.3
0.4
0.5
0.6
200
6
200
6
200
7
200
7
200
8
200
8
200
9
200
9
201
0
201
0
201
1
201
1
201
2
201
2
201
3
201
3
201
4
201
4
201
5
201
5
201
6
201
6
201
7
201
7
Liquidity & FX Policy Interest Rate Policy Macroprudential Policy
37
Central Bank Of Turkey: Evolution of Topics Monetary Policy Topics Distribution (% of Total)
Source: BBVA Research
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Multiple targets lead to different Policies…
Standard Monetary & Macroprudencial policies (Sentiments)
Deviation from target (reference): Inflation & Credit (FX adj. Loans YoY ninus 15% and inflation minus 5%)
-2
-1
0
1
2
20
07
20
08
20
09
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
Monetary Policy Macroprudential Policy
-25
-20
-15
-10
-5
0
5
10
15
20
25
30
-7
-5
-3
-1
1
3
5
7
20
07
20
08
20
09
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
Inflation Credit
Requires Tight Standard & Macro Prudential
Allows tight Standard & Ease Macro Prudential
Source: BBVA Research 36
Big Data & Big Models at BBVA Research Big Data at BBVA Research
-3.5
-2.5
-1.5
-0.5
0.5
1.5
2.5
3.5
200
6
200
7
200
8
200
9
201
0
201
1
201
2
201
3
201
4
201
5
201
6
201
7
201
8
-3.5
-2.5
-1.5
-0.5
0.5
1.5
2.5
3.5
200
6
200
7
200
8
200
9
201
0
201
1
201
2
201
3
201
4
201
5
201
6
201
7
201
8
Through Sentiment Analysis we can check “how” the CBRT is talking
and obtain some assessment of the monetary policy stance… (they can
be different depending on the documents)
Central Bank of Turkey: Monetary Policy Sentiment (Standardized, estimated through Big Data LDA and STM Techniques from Minutes & Statements)
Tightening
Easing
Monetary Policy “Statements” Monetary Policy “Minutes”
Source: Iglesias, J, Ortiz, A & Rodrigo, T (2007)
A more formal Statement… More extensive and analytical…
Source: BBVA Research
37
Big Data & Big Models at BBVA Research Big Data at BBVA Research
We can check whether the sentiment affects analysts (and test the
Machines & Dictionary methods compare with Experts analysis)
Experts vs Algorithms: Size of Surprises & Sentiments (Sentiments fron LDA Algorithm and MP Surprises by Demiralp et Al)
Monetary Policy: Experts vs Algorithms (Sentiments fron LDA Algorithm and MP Surprises by Demiralp et Al1=Hawkish, 0= Neutral, -1=Dovish)
-3
-2
-1
0
1
2
3
4
-150
-100
-50
0
50
100
en
e-0
6
ab
r-0
6
jul-06
oct-
06
en
e-0
7
ab
r-0
7
jul-07
oct-
07
en
e-0
8
ab
r-0
8
jul-08
oct-
08
en
e-0
9
ab
r-0
9
jul-09
oct-
09
en
e-1
0
ab
r-1
0
Surprise Size CBRT (Demiralp, Kara & Ozlu, EJPE 2011)
Monetary Policy Sentiment from Minutes (Normalized)
Monetary Policy Sentiment from Statements (Normalized)
-4
-3
-2
-1
0
1
2
3
4
-1.50
-1.00
-0.50
0.00
0.50
1.00
1.50
en
e-0
6
ab
r-0
6
jul-06
oct-
06
en
e-0
7
ab
r-0
7
jul-07
oct-
07
en
e-0
8
ab
r-0
8
jul-08
oct-
08
en
e-0
9
ab
r-0
9
jul-09
oct-
09
en
e-1
0
ab
r-1
0
Monetary Policy Surprise (Demiralp, Kara & Ozlu, EJPE 2011)
Monetary Policy Sentiment from Minutes (Normalized)
Monetary Policy Sentiment from Statements (Normalized)
Source: BBVA Research 38
Big Data & Big Models at BBVA Research Big Data at BBVA Research
A good test is whether Monetary Policy Sentiment can affect markets
(a necessary condition to affect the MTM)
Sentiment and the Markets: Response to Tighten and Easing (Changes interbank deposits and Swap rates after monetary policy sentiments changes Higher than 1 STD)
Source: BBVA Research
Tigthening Easing
39
Big Data & Big Models at BBVA Research Big Data at BBVA Research
But at the end “words” need to be complement by “deeds”
to affect the real economy an prices
Monetary Policy Rate Shock Monetary Policy “Sentiment “Shock
Monetary Policy Sentiment Effects: Standard Monetary Policy vs Sentiment (Changes interbank deposits and Swap rates after monetary policy sentiments changes Higher than 1 STD)
Monetary Policy Rate Shock Monetary Policy “Sentiment” Shock
Source: BBVA Research 40
Big Data at BBVA Research
ANNEX
Big Data & Big Models at BBVA Research Big Data at BBVA Research
The Retail Trade Index is a business cycle indicator which shows the monthly activity
of the retail sector (turnover)
Spain: Retail Trade Survey (RTS)
Population scope: stores whose main activity is registered in Division 47 of the NACE-2009,
which includes the following groups:
• Retail sale in non-specialized establishments (supermarkets, deparment stores, etc.)
• Retail sale in specialized establishments (food, beverages and tobacco; fuel; IT equipment and communications;
personal goods, such as fabric, clothing and footwear; household items, such as textiles, hardware, electrical appliances
and furniture; cultural and recreational items, such as books, newspapers and software; pharmaceutical products; etc.)
• Retail trade not carried out in establishments (eCommerce, home delivery, vending machines, etc.)
Sale of motor vehicles, Foodservice, hospitality industry, financial services, etc.,
are not included in RTS!
Sample: 12,500 stores (Random stratified sampling <50 employees + exhaustive>=50)
Dissemination: AA. CC. OR 5 distribution classes:
• service stations,
• single retail stores (one premises),
• small chain stores (2-24 premises & <50 employees),
• large chain stores (25 or more premises, and 50 or more employees)
• department stores (sales area greater than or equal to 2500 m2)
External sources: the case of Spain
42
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Average Tone: GDELT uses more than 40 tonal dictionaries to build a score ranging from -100 (extremely
negative) to +100 (extremely positive) for each piece of news, with common values ranging between -10
(negative) and +10 (positive), with 0 indicating neutral tone. A neutral sentiment can be the result of a neutral
language or a balancing of some extreme positive sentiments compensated by negative ones. The sentiment
variable is based on the balance between the percentage of all words in the article having a positive and
negative emotional connotation within an article divided by the total number of words included the article
PETRARCH coding system example:
Emotional indicator and coding system in GDELT
45
Big Data & Big Models at BBVA Research Big Data at BBVA Research
1 SOEI = 𝛾1𝑥1+ 𝛾2𝑥2+ … + 𝛾10𝑥10 + 𝜖1
2 HBI = 𝛿1𝑦1+ 𝛿2𝑦2+ … + 𝛿11𝑦11 + 𝜖2
3 S𝐵𝐼 = 𝛽1𝑧1+ 𝛽2𝑧2+ … +𝛽15𝑧15 + 𝜖3
4 𝐹𝑋𝐼 = 𝜌1𝑣1+ 𝜌2𝑣2+ … +𝜌10𝑣10 + 𝜖4
6 𝐶𝑆𝑉𝐼 = 𝜇1SOE + 𝜇2HB + 𝜇3SB + 𝜇4FX + 휀
1st Step Estimation: Components 2nd Step Estimation: Index
𝑤𝑖𝑡ℎ 𝛾𝑖 , 𝛿𝑖 , 𝛽𝑖 , 𝜌𝑖 𝑏𝑒𝑖𝑛𝑔 𝑡ℎ𝑒 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑒𝑣𝑒𝑟𝑦 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑝𝑟𝑖𝑛𝑐𝑖𝑝𝑎𝑙 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
𝑤𝑖𝑡ℎ 𝜇1, 𝜇2, 𝜇3, 𝜇4 𝑏𝑒𝑖𝑛𝑔 𝑡ℎ𝑒 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑒𝑣𝑒𝑟𝑦 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑝𝑟𝑖𝑛𝑐𝑖𝑝𝑎𝑙 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
𝑜𝑓 𝑡ℎ𝑒 𝑓𝑜𝑢𝑟 components
46
A two step procedure to extract common vulnerability factor
from Hard data, Markets and News Sentiment
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Documents are defined as paragraphs.
Documents with less than 200 characters are excluded (titles, contents sections,…)
Then words are stemmed (reduce a word to their semantic root) to generate tokens
Feature selection is conducted on the tokens: common stopwords and words with length 3 or less are removed and the remaining words are stemmed. Tokens are filtered out based on a term-frequency-inverse-document-frequency (tf.idf) index (Manning and Schutze 1999), words of the lowest quantile are removed. This indexing scheme is combined of a term-frequency index (tf) and a document frequency index (df). tf is just the count of a given word in a document, mean tf is used to construct the final index. df is the number of documents that contain a given word Then, the tf.idf used to filter words out is:
𝑡𝑓. 𝑖𝑑𝑓𝑖 = 𝑚𝑒𝑎𝑛 𝑡𝑓𝑖𝑗 ∗ 𝑙𝑜𝑔2𝑁
𝑑𝑓𝑖
where i indexes terms and j documents. This index gives high weight to frequent words through the tf component, but if a word is very prevalent through the corpus; its weight is reduced through the idf component. The aim of this filtering procedure is to remove very unfrequent as well as very frequent words, to remove words with low semantic content
47
Text mining and NPL: pre-proccesing and transformation
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003) is a Bayesian model with a prior distribution on the document-specific mixing probabilities where the count of terms within documents are independent and identically distributed given a Dirichlet prior distribution
To introduce time-series dependencies into the data generating process, we use the dynamic topic model (DTM), a particularization of the Structural Topic Models (STM) where each time period has a separate topic model and time periods are linked via smoothly evolving parameters
STM (Roberts et. al. 2016) explicitly introduces covariates into a topic model allowing us to estimate the impact of document-level covariates on topic content and prevalence as part of the topic model itself,
The process for generating individual words is the same as for plain LDA. However both objects can depend on potentially different sets of document-level covariates: Topic Prevalence (each document has P attributes that can affect the likelihood of discussing topic k) and Topic Content (each document has an A-level categorical attribute that affects the likelihood of discussing term v overall, and of discussing it within topic k. The generation of the k and d terms is via multinomial logistic regression
48
Machine learning algorithms on text: LDA, STM and DTM
Big Data & Big Models at BBVA Research Big Data at BBVA Research
49
Words (Tokens): basic unit of discrete data. Represented as an unit vector with a single 1 entry, and 0 in the remainder, this vector ha as many entries as total words under analysis.
Stop Words: “A”, “the” very frequent but don´t add value in term
Document: sequence of N words
Corpus: a collection of documents
Document-Term-Matrix: matrix where each row is the sum of all the words in a given document. As such we have documents in the rows, words in the columns, and each entry in the matrix is the number of occurences of a word in a given document
“Parsing” through (LDA): Some Basics
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Latent Dirichlet Allocation (LDA) (Blei et al. 2003) is a generative probabilistic (hierarchical Bayesian) model of a corpus. Documents are represented as mixtures over latent topics, where each topic is characterized by a distribution over words.
Simplified corpus generative process:
𝑁 ~ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(휁)
𝜃~ 𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡(𝛼)
for each word 𝑤𝑛
topic 𝑧𝑛 ~ 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙(휁)
𝑤𝑛 ~ 𝑝 𝑤𝑛 𝑧𝑛, 𝛽)
The Latent Dirichlet Allocation (LDA) Model
50
Source: Blei et al. 2003
Bag of Words assumption: Order of the words is not importat , only the occurence is relevant. This assumption is inherited, as LDA is an extension of the Latent Semantic Indexing algorithm (an SVD on the Document-Term-Matrix)
Words are conditionally Independent and Identically distributed: Needed when working with latent mixture of distributions, following de Finneti’s theorem (exchangeable observations are conditionally independent given some latent variable to which an epistemic probability distribution would then be assigned)
Big Data & Big Models at BBVA Research Big Data at BBVA Research
51
Structural Topic Model (Roberts et al. 2016) extends the LDA algorithm such that metadata (covariates) can affect the topic distribution. This allows us to introduce time series dependencies, estimating what is known as a Dynamic Topic Model
Topics can depend on 2 classes of covariates:
• Topic Prevalence (each document has P attributes that can affect the likelihood of discussing topic k)
• Topic Content (each document has an A-level categorical attribute that affects the likelihood of discussing term v overall, and of discussing it within topic k)
Extending the LDA: The Dynamic Topic Model
Big Data & Big Models at BBVA Research Big Data at BBVA Research
52
Dynamic Topic Models
generative process:
𝛾𝑘~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎𝑘2𝐼𝑝)
𝜃𝑑 ~ 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐𝑁𝑜𝑟𝑚𝑎𝑙 (Γ′𝑥𝑑′ , Σ)
𝑧𝑑,𝑛 ~ 𝑀𝑢𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝜃)
𝑤𝑑,𝑛 ~ 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝐵𝑧𝑑,𝑛
Source: Roberts et al. 2016
were Γ is a Px(K – 1) matrix of prevalence coefficients, d indexes documents,
n indexes words within documents and k indexes the latent topics
Structural Topic Model (Roberts et al. 2016) extends the LDA algorithm such that metadata (covariates) can affect the topic distribution. This allows us to introduce time series dependencies, estimating what is known as a Dynamic Topic Model (i.e Topics Change over time). Topics can depend on 2 classes of covariates:
Topic Prevalence (each document has P attributes that can affect the likelihood of discussing topic k)
Topic Content (each document has an A-level categorical attribute that affects the likelihood of discussing term v overall, and of discussing it within topic k)
Extending the LDA: Structural Topic Model & The Dynamic Topic Model
Big Data & Big Models at BBVA Research Big Data at BBVA Research
Sentiment analysis on text: lexicon approach
We rely on Lexicon methods using the Loughran-McDonald dictionary (Loughran McDonald 2009), a created dictionary specifically to analyze financial texts and the FED dictionary for financial stability (Correa et al, 2017)
Using the negative and positive words of this dictionary, the average “tone” of a given document
is computed by:
Average tone = 100 ∗ 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑𝑠 − 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑𝑠
𝑇𝑜𝑡𝑎𝑙 𝑤𝑜𝑟𝑑𝑠
The score ranges from -100 (extremely negative) to +100 (extremely positive) but common values range
between -10 and +10, with 0 indicating neutral
To build the final sentiment indices, we use the topic mixture that combines dictionary methods
with the output of LDA to weight word counts by topic, following the approach proposed
by Hansen and McMahon (2015). This allows generating different sentiment measures from a set of text,
and focusing that sentiment on the topics of interest
53
February 2018
Big Data at BBVA Research Big Data Workshop on economics and finance.
Bank of Spain
Alvaro Ortiz, Tomasa Rodrigo and Jorge Sicilia
top related