a vision of britain through time: online access to “statistical heritage”

4
67 june2007 A vision of Britain through time: online access to “statistical heritage” Statistics are not just for statisticians: planners, historians, schoolchildren and those researching their ancestry all want access to the records of the past. Humphrey Southall explains how data from our history have been made available to everyone who wants to know how the place that they live in has changed. family historians—“life-long learners”—seek informa- tion about particular towns and villages. Schools are the most numerous institutional census users, and all English schoolchildren must do “a study investigating how an aspect in the local area has changed over a long period of time”. “Aspects” that have been suggested in- clude: education, population movement, houses and housing, religious practices, and treatment of the poor; care of the sick (see http://www.nc.uk.net, history key stage 2). A Vision of Britain through Time is a website fund- ed by the UK National Lottery particularly to serve life- long learners, but the ONS also provided strong back- ing because of the regular requests for historical census data “for their area” that they receive from schools, which the website can also provide. Much of the content was funded by earlier grants from the Economic and Social Research Council and research charities. e Higher Education Funding Council for England recently fund- Britain has systematically gathered geographically based statistics for over two centuries—since the first census of 1801—but for most statisticians the past runs back only to the 1970s, when computerised data first become available. ere are many practical justifications for go- ing further back. e simplest is that people live longer than 40 years: by combining 1931 census data with the Office for National Statistics’s (ONS’s) Longitudinal Study, we showed that children growing up in areas of higher unemployment in the 1930s have worse health today than others of their age. 1 Many other trends work themselves out over decades, not years. e Greater London Author- ity needed data on industrial structure in the ames Gateway back to the 1950s; the Environment Agency has funded work with the 1930s Land Utilisation Sur- vey of Great Britain. ese uses of more than slightly historical data have immediate policy applications, but there is also a large demand from the public. Local and For most statisticians the past runs only to the 1970s, when computerised data became available.

Upload: humphrey-southall

Post on 21-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

67june2007

A vision of Britain through time: online

access to “statistical heritage”

Statistics are not just for statisticians: planners, historians, schoolchildren and those researching their ancestry all want access to the records of the past. Humphrey Southall explains how data from our history have been made available to everyone who wants to know how the place that they live in has changed.

family historians—“life-long learners”—seek informa-tion about particular towns and villages. Schools are the most numerous institutional census users, and all English schoolchildren must do “a study investigating how an aspect in the local area has changed over a long period of time”. “Aspects” that have been suggested in-clude: education, population movement, houses and housing, religious practices, and treatment of the poor; care of the sick (see http://www.nc.uk.net, history key stage 2).

A Vision of Britain through Time is a website fund-ed by the UK National Lottery particularly to serve life-long learners, but the ONS also provided strong back-ing because of the regular requests for historical census data “for their area” that they receive from schools, which the website can also provide. Much of the content was funded by earlier grants from the Economic and Social Research Council and research charities. Th e Higher Education Funding Council for England recently fund-

Britain has systematically gathered geographically based statistics for over two centuries—since the fi rst census of 1801—but for most statisticians the past runs back only to the 1970s, when computerised data fi rst become available. Th ere are many practical justifi cations for go-ing further back. Th e simplest is that people live longer than 40 years: by combining 1931 census data with the Offi ce for National Statistics’s (ONS’s) Longitudinal Study, we showed that children growing up in areas of higher unemployment in the 1930s have worse health today than others of their age.1

Many other trends work themselves out over decades, not years. Th e Greater London Author-ity needed data on industrial structure in the Th ames Gateway back to the 1950s; the Environment Agency has funded work with the 1930s Land Utilisation Sur-vey of Great Britain. Th ese uses of more than slightly historical data have immediate policy applications, but there is also a large demand from the public. Local and

For most statisticians the past runs only to the 1970s, when computerised data became available.

4(2)_05 Southall_StatsHeritage.indd 67 10/05/2007 12:08:43

68 june2007

ed the addition of British election statistics since 1832, to be launched in 2009.

Th e spatial database behind the website brings together data from a huge range of surveys of Britain, including historical map-ping and even travel narratives such as Dan-iel Defoe’s A Tour through the Whole Island of Great Britain (1724–1727), all integrated through geography. However, this article em-phasises statistical content, which includes lo-cal data from every census between 1801 and 2001, plus extensive vital statistics. Th e website can be found at www.VisionOfBritain.org.uk.

Anyone needing census data to see “how an aspect of their area has changed” over the last 200 years faces a series of problems. Few libraries hold complete runs of reports; assem-bling time series means fi nding area-specifi c data within the reports of separate censuses, each diff erently organised. Th ere were drastic changes in reporting geography in 1851, 1911 and 1981, plus a constant trickle of bound-ary revisions. Classifi cations also constantly change and, once geographies and classifi ca-tions have been standardised over time, most users will only see trends when numbers are presented graphically. Our system solves these problems for the user. From the disparate data of many sources it gives graphical results. It is literally “a vision of Britain through time”: an on-line system for time series visualisa-tion. Getting there from mountains of dusty reports in library basements posed several challenges.

Th e fi rst stage was computerising old re-ports. If the aim was just to let people “look at” census reports on the web, computerised images would have been suffi cient, and cheap. However, not only would this have failed to address most of the problems listed above, but also it is much harder to browse on-line im-ages than the original pages. We opted instead for full conversion of the scanned pages into text and numbers, combining optical character recognition with manual checking. Selectivity was inevitable, and we focused on tables that appeared in several successive reports and in-cluded plenty of geography. For example, we computerised the parish level tables for every census from 1851 to 1961, computing check-sums and cross-checking district totals against age-structure tables. Th e resulting statistics provide population counts for every adminis-trative level up to national totals. Th e accuracy achieved by the original reports without using

computers is remarkable. For example, Tables 16 and 17 in the 1931 Occupations report for England and Wales contain 464 999 data val-ues and only 10 numeric inconsistencies.

Th e resulting tables, which are easily us-able in spreadsheets, are suffi cient for much academic research but not for Vision of Brit-ain—or for large-scale research into long-run geographical change. As far as we know, all other large statistical databases are collections of tables, which researchers must download before the numbers within them can be used analytically, but behind Vision of Britain is a database of numbers held in one single column of one table, currently with 11 638 984 rows. Other columns record each number’s meaning, via codes defi ned in our sub-systems. See Fig-ure 1 for an overview of the architecture used.

Source

Users need to know where numbers come from, and to be able to see them in original context. Our source documentation system holds complete lists of British censuses up to 1961, of the reports each census published and, crucially, of the 5028 tables in those re-ports. We can therefore identify the table each number came from. We also record the col-umn and row within the table so that we can reconstruct the original tables. Our coverage of census reports is a reference resource in its own right, including a detailed guide2 and, as we add them, the texts of the preliminary and general reports plus selected other reports (see http://www.VisionOfBritain.org.uk/ census).

When

Our date objects can store anything from a simple year value, for census data, to a period defi ned by two calendar dates, for decennial mortality data.

Where

A place is a place is a place; however, published census data cover not “places” but administra-tive units with legally defi ned boundaries. Th e 1871 report covered the following: counties; parliamentary divisions and parliamentary boroughs; hides, tythings, hundreds, wapen-takes, wards, etc.; lieutenancy sub-divisions; petty sessional divisions; police divisions; high-way districts; local board districts; boroughs and towns with improvement commissioners under local acts; civil parishes and townships, and extra-parochial places; military districts

Data

Date object

Sourcedocumentation

system(censusreports)

Gazetteer What?Where?

Source?

Datadocumentation

system

When?Figure 1. Overview of Vision of Britain architecture

4(2)_05 Southall_StatsHeritage.indd 68 10/05/2007 12:09:25

69june2007

and sub-districts; post offi ce districts; inland revenue districts; poor law unions; registration districts and sub-districts; census enumeration districts3.

For example, the place-name “Reading” refers to 14 distinct units (including Reading St Giles, Reading St Lawrence and Reading St Mary), all based on Berkshire’s county town, many existing simultaneously.

Although we had already computerised boundaries for pre-1911 registration districts, post-1911 local government districts and, at a very detailed level, civil parishes, Vision of Britain is based not on a geographical informa-tion system but on an ontology (a “specifi cation of a conceptualisation” but, more practically, see http://www.w3.org/TR/owl-features), which records what units ex-isted, their quasi-hierarchical relationships with other units and the various names each went by. Th e naming of Welsh parishes by successive censuses is particularly confused. If we have boundaries for a unit, they are linked to this structure and are used in statistical mapping. However, we can still hold data for units not yet mapped, and present time series. Th is administrative unit gazetteer is based on existing name authorities identifi ed by the

National Council on Archives (http://www.ncaonline.org.uk/materi-als/namingrules.pdf), and is again a reference resource in its own right, current-ly defi ning 51 992 units, 64 208 unit names and 179 667 relationships between units (see http://www.VisionOfBritain.org.uk/units).

What

Th e biggest challenge was designing a system to record what each number measured, not as unstructured text but so as to defi ne how data should be presented in maps and time series. Our data documentation system is based on the aggregate data extension to the data documen-tation initiative standard (http://www.icpsr.umich.edu/DDI), again organ-ised as an ontology. It is highly abstract but not unmanageably complex. In brief, every docu-mented data value is located within an nCube, which is essentially a matrix whose dimensions are defi ned by variables, each consisting of a set of categories. For example, the most complex nCubes currently held represent data from the Registrar General’s decennial supplements. Th e 1861–1870 supplement defi nes a 2 ×

12 × 25 nCube: sex by age group by cause of death. Once we mapped the fi ve diff erent cat-egorisations used between 1851 and 1910 to a single simplifi ed system, a single query gener-ated 796 740 new data values charting mortal-ity trends over 60 years (see http://www.VisionOfBritain.org.uk/data).

Two additional columns in the data table record status, allowing us to use semi-confi -dential data in graphs but exclude them from tables, and precision, identifying numbers that are estimates, derived from 10% samples, for example. Th is unique architecture enables us to hold any amount of statistical data without adding more database tables. Our software therefore always knows where to fi nd the sta-tistics to be mapped and graphed: in the data column of the data table. It works out what kind of graph to create from nCube charac-teristics. For example, it created the popula-tion pyramids shown in Figure 2, not because the variables were age and sex, but because the values were in a two-dimensional nCube, with one variable having only two categories.

Th e 1881 census report complained that:

“… the diffi culties increase each Census with the formation of new areas. One great addition to the labour was caused … by the institution in the last decade of Sanitary Districts. … the Urban Sanitary Districts, nearly a thousand in number, with areas defi ned very fre-quently without any apparent regard to other administrative areas, have added very materially to the toil of our work4.”

Th ey would have complained still more had they known sanitary districts would be aban-doned after 1891. Th e ONS’s new “super out-put areas” will hopefully last longer, but even registration districts (624 units in England and Wales in 1851) and local government dis-tricts (1841 units in 1911 dropping to 1366 in 1971), and the associated kinds of county district, lasted only 60 years or so. Studying long-term trends therefore requires retrospec-tive standardisation of census geographies. Our method is based on knowing not just the boundaries of the original reporting units, which limits us to registration and local gov-ernment districts, but also the more detailed boundaries of civil parishes, with their popula-tions. Reliable results require that the output units have a generally simpler geography, so we present standardised time series only for the 408 districts and unitary authorities that exist-ed in 2001, and counties and regions based on them. Th e modern boundaries were supplied

85

2001:

Figure 2. Age structure of Reading (2001 census boundaries): (a) 1851; (b) 2001

(a)

(b)

4(2)_05 Southall_StatsHeritage.indd 69 10/05/2007 12:09:29

70 june2007

by the ONS and the General Register Offi ce (Scotland).

For each census from 1801 to 1961 we constructed geography conversion tables5 by estimating the proportion of each historical unit’s population falling into each 2001 district via vector overlay of the boundaries, assuming that people were evenly distributed across each historical parish. Using these conversion tables to standardise, say, unemployment means also assuming that the proportion of people un-employed was constant across each historical district. Th is second assumption is obviously questionable, but most urban historical units fall entirely within a single modern district. Vision of Britain also presents more recent historical data for the period 1971–1991, but these were redistricted separately by using the linking censuses through time system developed by Danny Dorling (see http://census.ac.uk/cdu/software/lct).

Our presentation of long-run trends ex-tends the key statistics release from the 2001 census backwards.6,7 Basic population statistics were unproblematic, as were simple vital rates. Some variables are impossible to reconstruct for earlier dates. For example, the only census pre-2001 to study religion was 1851, and that was through a separate survey of church attendance covering kinds of Christian only, not diff erent religions. Statistics on housing and education are mostly only available from 1951 onwards. Th e largest reclassifi cation task was extracting information on three of our themes, “industry”, “employment and poverty” and “social struc-ture”, from occupational tables back to 1841. Th at year’s census was the fi rst to gather occu-pation data but offi cials did not anticipate the diversity of the results, so no occupational clas-sifi cation existed. Th e report lists 3649 occupa-tions, ranging from “aurist” via “madder-maker”

to “zincographer”, with diff erent lists for each county, which our data documentation system maps to progressively simpler classifi cations, ending with the six sectors we identify for eight censuses between 1841 and 2001: agriculture, mining, manufacturing, construction, utilities (including transport) and services. For the 1881 census, we used data from the transcription of the original enumerators’ books coordinated by

the Mormons, involving about 1.5 million dif-ferent occupational titles. (Th e Mormon church publishes a huge amount of geneological data on the Internet, making a major resource for family historians.) Figure 3 shows one end re-sult, which Vision of Britain can provide for each of 408 local authorities. (Even limiting analysis of long-run employment trends to six sectors is problematic: our 1981 data come from the small area statistics within which mining can-not be distinguished from utilities—particu-larly unfortunate for Easington.)

Th e advanced technology underpinning Vision of Britain creates an easy-to-use site. In particular, postcodes or place names typed into our home page lead directly to “location” and “place” pages, listing administrative units cover-ing the point or named after the place, draw-ing particular attention to the relevant modern authority. Th ese all link to “unit home pages” listing available statistical themes. Each theme begins with graphs of pre-defi ned rates, which

can also be mapped. Users can also directly ac-cess the rows of original census tables covering the current unit and view boundary maps.

Existing data libraries follow procedures developed when data storage was scarce and expensive. In comparison, our architecture is wasteful of storage—but the whole system would still fi t on an iPod. Such architecture was essential for a system that needs, for ex-ample, to take the population totals for each parish in Britain out of the separate census ta-bles in which they originally appeared and to present them to users as a single time series. We believe it can also form the basis for new analytic approaches. In particular, grid com-puting provides an infrastructure for explora-tion of very large scale social science datasets by pattern-seeking automata, but this is only possible if individual data values are held in a consistent structure such as in Vision of Brit-ain. However, the immediate problem is sim-ply preserving the website and data structure: construction funding now totals over £2 mil-lion and the site attracts 60 000 unique users per month, but running costs are currently covered only until the end of 2008.

References1. Curtis, S., Southall, H. R., Congdon,

P. and Dodgeon, B. (2004) Area eff ects on health variation over the life-course: analysis of the lon-gitudinal study sample in England using new data on area of residence in childhood. Social Science and Medicine, 58, 57–74.

2. Offi ce of Population Censuses and Sur-veys (1977) Guide to Census Reports: Great Britain 1801–1966.

3. 1871 Census Report. Appendix C: Ter-ritorial Sub-Divisions of England, p. 175.

4. 1881 Census Report. England and Wales, vol. 1, Area, Houses, and Population: Coun-ties, pp. iii–iv.

5. Simpson, L. (2002) Geography conver-sion tables: a framework for conversion of data between geographical units. International Journal of Population Geography, 8, 1.

6. National Statistics (2003) Key Statistics for Local Authorities in England and Wales. London: Offi ce for National Statistics

7. General Register Offi ce (Scotland) (2003), Key statistics for Council Areas and Health Board Areas . Edinburgh: General Register Offi ce (Scotland)

8. Woollard, M. (1999) Th e Classifi cation of Oc-cupations in the 1881 Census of England and Wales.

Humphrey Southall is a Reader in Geography at the University of Portsmouth, and Director of the Great Britain Historical GIS Project which created the Vision of Britain website.

Figure 3. Industrial structure of Easington district, Durham, 1841–2001

The 1841 census lists occupations ranging from

“aurist” via “madder-maker” to “zincographer”.

4(2)_05 Southall_StatsHeritage.indd 70 10/05/2007 12:09:30