using alternative data sources to produce consumer price
TRANSCRIPT
02 December 2019
Using alternative data sources to produce consumer price indices
Liam and Lefteris
4 December 2019
Liam Greenhough
Consumer Prices Methods Transformation
Overview of the Alternative Data Sources Project
How price statistics are measured
In January, select 700 “items” to track over year. Known
as the fixed basket. Each year the basket is “refreshed” to
account for changing consumer behaviours.
4 December 2019
How price statistics are measured
4 December 2019
For each item, select a group
of products to track over the
year.
Each item is an aggregate –
but is also a “subset” of higher
aggregates.
How price statistics are measured
Collect prices of products each month. These are
collected:
• Locally, and
• Centrally
Approximately 180,000 price quotes are collected per
month.
4 December 2019
How price statistics are measured
Use index formulae to compare prices
of products across months. Most
common index is the Jevons.
Use weights to aggregate upwards to
higher-level indices.
4 December 2019
4 December 2019
0
1
2
3
4
5
1996 JAN 1999 JAN 2002 JAN 2005 JAN 2008 JAN 2011 JAN 2014 JAN 2017 JAN
CPIH compared to the current Bank of England inflation target
CPIH Inflation target
4 December 2019
Consumer Price Statistics: Alternative
Data Sources
Alternative Data
Looking to implement two new data
sources:
• Scanner data – transactional data from
large retailers
• Web scraped data – data scraped from
online retailers
Aim to use in conjunction with traditional!
4 December 2019
Alternative Data – targeted items
4 December 2019
4 December 2019
Data dimension Traditional Scanner data Web scraping
Data acquisition Manual Automated Automated
Completeness/scope Sample from all
retailers
All transactions (bulk)
from medium to large
retailers
Bulk or sample from
online retailers
Metadata Item description Item description +
limited attributes
Item description +
attributes
Quantity data N/A Quantities sold N/A
Timing Single collection day Daily Daily
Big data
System needs to process big
data.
4 December 2019
Traditional data sources: ~180,000 price quotes per month
Scanner data: ~100,000,000 price quotes per month
4 December 2019
The Team
4 December 2019
(Prices)Data
Transformation
(Prices)Methods
Transformation
Emerging Platforms
Methodology
4 December 2019
Some of the research
Scalability
Not possible to manually
scrutinise big data, e.g.
classification.
4 December 2019
Product Churn – synthetic
4 December 2019
0
10
20
30
40
50
60
70
80
90
100
Jan:Jan Jan:Feb Jan:Mar Jan:Apr Jan:May Jan:Jun Jan:Jul Jan:Aug Jan:Sep Jan:Oct Jan:Nov Jan:Dec
The combination problem
Lots of steps to calculate indices
Different methods at each step
Leads to many potential combinations!
4 December 2019
4 December 2019
Our plans
4 December 2019
may be brought forward
with uarterly ublications
if feasible to do so
item coverage de ends on data
availability
4 December 2019
Lefteris Karachalias
Emerging Platforms Development and Support Team
Consumer Prices Data Transformation:
Development
Overview
• The system
o Overall system architecture
o User interaction
• Development framework
o Development project delivery team
o Tools
o Documentation
o Dev&Test
4 December 2019
Overall system architecture
4 December 2019
Data supplier Research/
Publication
DAP: Workspace zone
Staged Processed Analysis
DAP: Landing zone
Raw
Core
pipeline
Analysis
pipelineValidation
Staging
Classification
Decision
rules
Retailer
Expenditure
weights
FTP
Raw
Data Engineers
Core pipeline
4 December 2019
Multiple configuration scenarios
4 December 2019
pipeline
config 1
Staged
data
Stage 1 output
Stage 2 output
Stage 3 output
pipeline
config 2
Stage 1 output
Stage 2 output
Stage 3 output
pipeline
config 3
Stage 1 output
Stage 2 output
Stage 3 output
scenario stage
1 1
2 1
3 1
scenario stage
1 2
2 2
3 2
scenario stage
1 3
2 3
3 3
Processed
data
User interaction
• UI: CDSW / HUE
• Manual
• Configuration
• Mappers (BAU)
• Dashboard (cannot share VDI)
• Output tables
4 December 2019
Development framework (1)
• Project delivery team
• Development phase: Between Discovery and Alpha phase
• Agile, Jira
• DAP, PySpark, HDFS, HIVE (sensitivity)
• Git, GitLab
4 December 2019
Project delivery team
4 December 2019
Project Specialist
Technical Lead
Methods Specifier
Business Analyst
Configuration Engineer/Developer
ConfigurationArchitect
Tester
Product Owner
Development framework (2)
• Unit testing, CI with Jenkins, UAC
• Documentation Sphinx, user manuals
• Business Analysis models, Sparx
• Business Architecture: pushing to the SML
• Synthetic data, Dev&Test, packaging
4 December 2019
Statistical process model
4 December 2019
Data (journey) model
4 December 2019
Input Processing Output
Dev and Test environment
4 December 2019
Real data
ProdDev&Test
Synthetic
data
package
CDSW
prod
Synthetic
data
CDSW
dev
package
CI
JenkinsOutput
Thank you!
4 December 2019
4 December 2019
Any questions?