official statistics in the new data ecosystem · crime maps “for many years, all forces have...
TRANSCRIPT
![Page 1: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/1.jpg)
1
Official Statistics
in the New Data Ecosystem
David J. Hand Imperial College, London
March 2015
![Page 2: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/2.jpg)
2
Pascal’s wager on the existence of God: “You must wager.
It is not optional. You are embarked.”
![Page 3: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/3.jpg)
3
Pascal’s wager on the existence of God: “You must wager.
It is not optional. You are embarked.”
Same is true for official statistics and the “new data ecosystem”
![Page 4: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/4.jpg)
4
Pascal’s wager on the existence of God: “You must wager.
It is not optional. You are embarked.”
Same is true for official statistics and the “new data ecosystem”. It’s out there and
one either ignores it, believing it’s inconsequential, or one engages with it.
Official statisticians have to bet one way or the other
![Page 5: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/5.jpg)
5
Ignore it and you risk becoming irrelevant Engage and you are leaping aboard a treadmill which is getting faster and faster ‐ tomorrow’s IT will be different from today’s (think Twitter, YouTube, etc)
‐ “The future you have tomorrow won't be the same future you had yesterday.” Chuck Palahniuk
‐ “In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world that no longer exists.” Eric Hoffer
![Page 6: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/6.jpg)
6
What’s new in the world of data? ‐ source of data: automatic acquisition of data
‐ speed of acquisition of data: “streaming data”
‐ size of data sets: “big data” ‐ diversity of data ‐ complexity of data
![Page 7: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/7.jpg)
7
Source: Modern data capture technologies
Automatic data collection: ‐ electronic measurements: point of sale credit card terminals, petrol pumps, contactless travel cards, phone records, emails, GPS, CCTV cameras, ...
Internet of things Administrative data Social media data – data directly from the web
“Properties” of automatic data collection: ‐ immediate ‐ complete ??? ‐ untouched by human hands ???
![Page 8: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/8.jpg)
8
Speed: Realtime data collection – and analysis This has several major implications,
threatening the position of NSIs ? 1: timeliness The balance of timeliness against accuracy: Example: UK GDP 1st estimate: 44% of the data available by 25 days, 2nd estimate: 88% by 55 days, 3rd estimate: 85 days
![Page 9: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/9.jpg)
9
What about inflation rate?: Elaborate procedure to collect sample data Contrast with direct recording from transactions: “Google has created a new inflation measure – the Google Price Index – based on the cost of goods sold online which could prove more accurate and up‐to‐date than official statistics. Google's mountain of web shopping data could also be used by the online group for economic forecasting ahead of the publication of official statistics, the Financial Times reported.
Google's chief economist Hal Varian said that he is working on ‘predicting the present’ by using real‐time search data to forecast official figures which often are published at least a month after the period they cover.” http://www.theguardian.com/business/2010/oct/12/google-create-new-inflation-measure
![Page 10: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/10.jpg)
10
2: new kind of analytic tools needed “Streaming data”:
the data keep on coming, like water from a hose Permanently executing analytic tools, processing the data as it arrives
‐ anomalies ‐ changes ‐ summaries (trends, averages, variability, maxima, ...)
Realtime → automatic analysis Contrast the more familiar: a fixed database
![Page 11: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/11.jpg)
11
Note: perhaps we cannot store the data once it has been processed In that case we need to know what questions we will ask We cannot later ask arbitrary questions, but only those that can be answered from our summary statistics Summarising a stream
Subsetting a stream: sampling, but requires different approaches from classical survey sampling
Filtering a stream: accept only those cases which meet some criterion
![Page 12: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/12.jpg)
12
Size: Large data sets Census: the largest set of data collected by NSIs until relatively recently ?
But now administrative data, register‐based data
‐ some countries (e.g. Scandinavian) ahead of the field
transaction data even administrative data sets can be tiny compared with those arising from automatic data collection (e.g. social media, Google searches, twitter messages,email transaction logs, phone logs, transport logs, ...)
![Page 13: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/13.jpg)
13
Diversity of data Survey, census, administrative, transaction, experimental, ...
Numerical tables, image, text, signal, networks, ...
Different kinds of data have different properties e.g. survey data: answers to the questions you choose but slow and expensive to collect, response bias? e.g. transaction data: fine granularity, both spatial and temporal, immediate, but may not address the question you want, complete population coverage – in principle, but rarely in practice
→ an opportunity: Perhaps data of different kinds can be combined synergistically, to overcome the problems of each individual kind
![Page 14: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/14.jpg)
14
Stitching different kinds of data together Linking Matching Merging
Technical challenges have begun to be addressed in different fields
e.g. medical combining information from scans with traditional numeric and text files
Great potential benefit from cross‐disciplinary collaboration and awareness
![Page 15: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/15.jpg)
15
“survey and census data is what they say: administrative and transaction data is what
they do”
![Page 16: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/16.jpg)
16
“survey and census data is what they say: administrative and transaction data is what
they do”
New forms of data are closer to social reality
![Page 17: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/17.jpg)
17
Complexity of data Especially, networks and linked data
‐ Social networks
‐ Cybersecurity
‐ Fraud detection
![Page 18: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/18.jpg)
18
Ping‐Pong Fraud Ring Model (HBOS Data Mining Team, 2007) Within the Mortgage Industry, in addition to individual fraudulent applications, fraud can also occur in groups often referred to as Fraud Rings. These groups can consist of applicants, brokers and/or other professionals that get together to cheat the system. The purpose of the Ping‐Pong Fraud Ring Model is to assist the Mortgage Fraud team in identifying these Fraud Rings. This is a 3 step approach:
1. Ping‐Pong clustering: aggregating applications that have common names, addresses, telephone numbers, employers, etc
2. Ranking: inconsistency rules are used to rank the clusters in order of severity.
3. Linkage analysis: identifying links between customers, brokers, employers, and/or solicitors.
![Page 19: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/19.jpg)
19
OTHER CRITICAL ISSUES
Data quality “most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis”
Lazer et al 2014
Caution: No data set is without potential quality issues Role and reputation of NSIs, as standard bearers of quality
![Page 20: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/20.jpg)
20
Example: Mistaken idea that admin data/transaction data has no quality issues
Quality Assurance of Administrative Data Administrative Data Quality Assurance Toolkit
UK Statistics Authority, January 2015
![Page 21: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/21.jpg)
21
Top Tips UKSA Administrative Data Quality Assurance Toolkit, 2015 These five tips summarise the main pointers for statistical producers to develop a good understanding of the quality issues of administrative data: Don’t trust the safeguards. Check if safeguards are functioning effectively
Get involved: work and share with suppliers, such as through secondments and webinars, to develop a common understanding
Raise a red flag: identify potential data quality concerns using input and output quality indicators and investigate anomalies
See the big picture: identify what investigations and audits have been conducted and what they found
Corroborate the evidence: confirm the levels and the trends shown by the stats derived from the admin data
![Page 22: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/22.jpg)
22
Example: Even transaction data implies a complex sociological selection process
e.g. reject inference in credit scoring
![Page 23: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/23.jpg)
23
Example: Instrumental failure
Wind speed in metres per second, against time
![Page 24: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/24.jpg)
24
Example: Proportion of homeowner missing values vs time
![Page 25: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/25.jpg)
25
The rate of change of technology Implications:
‐ for series based on a particular technology, what happens when it changes, or disappears altogether (Windows XP)
‐ surveys have been around forever, but Google Trends?
![Page 26: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/26.jpg)
26
Statistical subtleties The danger of “the data speak for themselves” ‐ selection bias ‐ regression to the mean ‐ individual behaviour vs population behaviour
Are wages improving or remaining constant?
A1: calculate the median wage at time 1 and time 2 A2: calculate the change of individuals in the population
![Page 27: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/27.jpg)
27
Google flu trends
“In February 2013, ... Nature reported that [Google Flu Trends] was predicting more than double the proportion of doctor visits for influenza‐like illness than the Centers for Disease Control and Prevention, which bases its estimates on surveillance reports from laboratories across the United States ... despite the fact that GFT was built to predict CDC reports.”
Lazer et al 2014
![Page 28: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/28.jpg)
28
Initial version: find best matches from 50 million search terms for 1152 data points
Overfitting! → part flu detector, part winter detector Updated, 2009: still consistently overestimated flu prevalence (100 out of 108 consecutive weeks); autocorrelated errors, etc Some understanding of classical time series modelling could help
![Page 29: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/29.jpg)
29
Correlation = 0.992
![Page 30: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/30.jpg)
30
Predicting the financial markets
Preis T, Moat H.S., and Stanley H.E. (2013) ‘Quantifying trading behavior in financial markets using Google trends.’ Scientific Reports, 3, 1684.
‘By analyzing changes in Google query volumes for search terms related to finance, we find patterns that may be
interpreted as ‘‘early warning signs’’ of stock market moves.’
![Page 31: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/31.jpg)
31
Chose 98 search terms related to stock markets
Implemented a strategy which ‐ sold DJIA following an increase in searches ‐ and bought following a decrease in searches
Concluded: "We find that returns from the Google Trends strategies are significantly higher overall than returns from the random strategies (<R>=0.60; t=8.65, df=97, p<0.001, one sample t‐test)."
But this is an inappropriate test Telling us nothing about the effectiveness of the strategy
![Page 32: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/32.jpg)
32
1) the 98 terms were purposively sampled, not a random sample from a population of possible terms: the authors confused fixed and random effects;
2) the standard deviation used in the denominator of their t‐test is irrelevant to the variability of the mean return: it’s the wrong statistic;
3) they ignored the correlation between the random aspect of the different words’ returns: anyone working in the financial sector will be aware of the critical importance of correlation!
![Page 33: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/33.jpg)
33
Google Price Index My earlier quote about the Google Price index was from 2010 Where is the GPI now?
“[Hal Varian, Google’s Chief Economist] responded that the GPI was never intended to be a public project, data source, etc. It was simply a project internal to Google that got hyped up by the press”
http://money.stackexchange.com/questions/21629/what‐happened‐to‐the‐google‐price‐index
![Page 34: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/34.jpg)
34
Data not designed to answer the research or policy question, may not be well‐suited to answer that question
‐ administrative data to replace the census ?
![Page 35: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/35.jpg)
35
Crime maps
“For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and tackle crime vigorously. The information now on the forces’ websites has a different more community‐focused perspective and means the public can now look at crime levels in their community simply by putting their postcode into their local police force’s website.”
www.crimemaps.org.uk
![Page 36: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/36.jpg)
36
“More than 5.2 million people have not reported crimes for fear of deterring home buyers or renters since the online crime map was launched in February 2011”
“A quarter (24 per cent) of people would not report a crime for fear it would harm their chances of selling or renting their property”
http://www.directline.com/media/archive‐2011/news‐11072011
![Page 37: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/37.jpg)
37
Trust, privacy, ... ‐ people readily give data to supermarkets, travel companies, phone companies, credit card companies,..
‐ government misuse, targeting subgroups. Formal separation of statistical offices from government?
Anonymisation De‐identification Trusted Third Party linkage methods → need for sta s cians to learn new skills → broader data scientists
![Page 38: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/38.jpg)
38
Interaction with the commercial world The credit bureau model: everyone benefits NSI takes fine granularity commercial (e.g. transaction) data, combines it with public data, analyses it, and feeds back the results to the commercial body – for a price NSI retains control, protects public
![Page 39: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/39.jpg)
39
The skills gap What skills are needed?
More than the current skills
How many NSI statisticians are at ease with realtime transaction data?
![Page 40: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/40.jpg)
40
Big data and open data are only useful if you have the skills to extract information When the judge told barrister Frederick Edwin Smith: ‘I’ve listened to you for an hour and I’m none the wiser’
![Page 41: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/41.jpg)
41
Smith replied: ‘None the wiser, perhaps, my lord, but certainly better informed.’
![Page 42: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/42.jpg)
42
But being better informed is not much use if you cannot make good use of that better information
![Page 43: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/43.jpg)
43
But being better informed is not much use if you cannot make good use of that better information
Wisdom is also necessary
![Page 44: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/44.jpg)
44
But being better informed is not much use if you cannot make good use of that better information
Wisdom is also necessary
We need the skills to extract wisdom from
information
![Page 45: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/45.jpg)
45
Conclusions
‐ huge opportunities
‐ technical and social challenges
‐ which must be grasped: Pascal’s wager: we are embarked; we cannot bury our heads in the sand
‐ but we need the capability to grasp the opportunities
‐ we need to train people with the skills
![Page 46: Official Statistics in the New Data Ecosystem · Crime maps “For many years, all forces have mapped crimes and incidents to help them focus investigations, analyse hot spots and](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f78e77e708231d4445538/html5/thumbnails/46.jpg)
46
thank you