identifying the cash-rich and the cash-poor: lessons from the census rehearsal dr paul williamson...

Post on 20-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Identifying the cash-rich and the cash-poor:

Lessons from the Census Rehearsal

Dr Paul Williamson

Department of Geography

ESRC Census Development Programme

• Most requested addition to 2001 Census

INCOME…

The 2001 Census Geography of income:

Other sources of data on income

• Benefits data

• Government surveys(e.g. GHS, LFS, FES, FRS, NES)

• Commercially-held data[Postcode sector and postcode unit estimates]

• The Census Rehearsal (1999)

Objectives

Evaluation of:• Extant methods for small-area income

estimation

• New approaches

• Utility of non-census information(e.g. council tax; house price; benefits data)

[ • Methods of imputing income band means ]

Definition of ‘income’

• Income Wealth

• Gross or net income?

• Pre or post housing costs?

• Adult or Household?

• Household?– Total– Equivalised

[Per capita / OECD / McClements]

Surrogates

• Univariate– % unemployed– % 2+ car households– % residents in Social Classes I + II– % owner-occupation

• Multivariate (deprivation indices)– Carstairs – Townsend– Breadline– DLTR Index of Multiple Deprivation 2000– Green (Wealth)[owning 2+ cars; NS-SEC I or II; High qualifications]

• Geodemographic– SuperProfiles– MOSAIC– GB Profiles

• Model

Individual income– Dale (SOC2000; Economic activity; age; sex;

Region]– Lee (SOC2000; Economic activity]– Regression (individual and/or ecological)

Household income– Regression (household and/or ecological)– Bramley & Smart (H/h comp.; earners; tenure;

area level deprivation)

The 1999 Census RehearsalKey features• full census questionnaire

+ INCOME• Large achieved sample

• Spatially contiguous

– c. 65,000 households– c. 140,000 individuals

Clustered sampling

strategy:– 7 part districts

[Excluding NI] – 38 wards– 650 EDs

• non-response rate– overall (~ 50%)– income (~15%)– other variables (5-20%)– full responses for ~ 55 % of achieved sample [individuals and households]

• non-response bias

Potential problems

Income No dataBand All missing£0 20.8 16.9<£60 13.2 11.0£60-£119 20.5 18.7£120-£199 15.5 16.1£200-£299 13.3 15.9£300-£479 11.3 14.6£480+ 5.5 6.8Total 100.0 100.0N 125138 67283

Social Class No data(1991) All missingNone 28.2 25.4I 4.0 4.9II 18.9 21.4III(N) 17.5 19.3III(M) 11.5 10.8IV 14.7 13.7V 5.0 4.3Army 0.2 0.2Total 100.0 100.0N 117010 67283

Correlation coefficientIndicators (calculated for 1991 Enumeration districts) Original IdealTownsend index 0.82 0.79% households with

No car 0.89 0.86 2+ cars 0.87 0.83% households

Owner-occupied 0.90 0.87 Social rented 0.94 0.92 Detached 0.98 0.97 Flats 0.92 0.92% of economically active

Unemployed 0.57 0.55 Social Class I+II 0.58 0.56

Rehearsal sub-set

• Banding of income question

What is your total current gross income from all sources?

Per week or Per year (approximately)

Nil _ NilLess than £60 _ Less than £3,000£60 to £119 _ £3,000 to £5,999£120 to £199 _ £6,000 to £9,999£200 to £299 _ £10,000 to £14,999£300 to £479 _ £15,000 to £24,999£480 or more _ £25,000 or more

– Only 10% of adults in top band

– but problem compounded when individual incomes aggregated to estimate household income

– band mid-point band mean– value of band means area sensitive?

Income Nationalband (£) Average Band A Band H

Income band mean0 0 0 01-60 34 35 2761-120 91 93 86121-200 156 155 156201-300 245 241 242301-480 375 364 391481+ 765 652 1353

Council Tax

Source: FRS 1998/9 (Crown Copyright)

Digression: modelling income band means

Alternative modelling strategies include:

• National mean

• Sub-group mean (e.g. by council tax band)

• Statistical distributions (log-normal; pareto)

• New variant of log-normal approach with addition of modelled median etc.

Results

• For all bands sub-group mean best– if possible

• For closed-bands, national mean is next best

• For open (top) band, new proposed log-normal approach is best, particularly where there is evidence of strong spatial clustering

– At what scale does income vary most?

• MAUP– 1991 vs 1998/9 boundaries– zones with <10 households or 25 residents

excluded from analysis

• SOC 2000 / NS-SEC– Lack of alternative SOC2000 coded data– Therefore have to use Census Rehearsal data– Use partitioned data to avoid unduly

advantaging SOC2000 based approaches

• Spatial scale

Results

Census Rehearsal Income Distribution

0

5

10

15

20

25

30

Nil <3 3-5 6-9 10-14 15-24 25+

Annual Gross Income (£ 000s)

• At ward level the % household reps. in top income-band averaged 9.1%– but ranged from 2.8% to 21.6%

• 89% of EDs contained one or more household reps. in top income-band– i.e. in top income-decile of the population

Heterogeneity rules OK!

Income distribution of household representative (Person 1 on Census Rehearsal form)

All EDs

0

4000

8000

12000

16000

EDs in lowestincome quintile

0

1000

2000

3000

4000

EDs in secondincome quintile

0

1000

2000

3000

4000

EDs in middleincome quintile

0

1000

2000

3000

4000

EDs in fourthincome quintile

0

1000

2000

3000

Nil <3 3-5 6-9 10-14 15-24 25+

Income bracket (£000 p.a.)

EDs in topincome quintile

0

1000

2000

3000

4000

5000

Nil <3 3-5 6-9 10-14 15-24 25+

Income bracket (£000 p.a.)

Missing data

• Missing data have minimal impact on results– From ‘Raw’ to ‘Ideal’ data, most correlations

change by <0.02– Very few values change by >0.05– Exception is NS-SEC 8 [by definition!]– Correlations lower for ‘Ideal’ than ‘Raw’

• Surrogates calculated direct from Rehearsal– circumvents data response bias?

Scale

• Higher correlations at higher geographies

• District effect small but significant– BUT none of districts in SE England

Overfitting• No significant impact

MAUP

• Correlations vary by up to 0.1 between alternative boundaries at same spatial scale

BUT

• No detectable effect on rankings of surrogate income measures

Adult income (r2) Surrogate

Ward

ED

Post-code

Univariate NS-SEC 1+II 0.81 0.81 0.64

Multivariate Townsend 0.36 0.46 0.38 Green (wealth) 0.57 0.55 0.50

Geodemographic PCA_96 Na 0.82 0.69 Voas 0.83 0.59 0.48

Model Dale 0.91 0.89 0.90 Lee 0.90 0.87 0.88 Voas (individual) 0.91 0.80 0.83 [See final slide for definition of ‘surrogates’]

Caveats• ‘Best’ performing surrogates in danger of

over-fitting?– For Dale, Lee and Voas mean occupational

income calculated directly from Census Rehearsal dataset (no other SOC2000 sources available at time of analysis)

BUT– No significant difference if SOC minor or unit

codes used– No significant difference if data partitioned

Household income (r2) Surrogate

Ward

ED

Post-code

Univariate NS-SEC 1+II 0.82 0.81 0.64

Multivariate Townsend 0.48 0.46 0.44 Green (wealth) 0.61 0.50 0.56

Geodemographic PCA_96 na 0.81 0.67 Voas 0.81 0.60 0.48

Model Dale 0.90 0.85 0.86 Lee 0.87 0.83 0.83 Voas (household) 0.76 0.74 0.74 [See final slide for definition of ‘surrogates’]

Accuracy

• For many purposes relative, rather than absolute, accuracy is most important

ranking

a) NS-SEC based income surrogate [NSSEC12]

0%

25%

50%

75%

100%

0 100 200 300 400 500 600

Observed mean individual income (£ week)

% o

f ec

onom

ical

ly a

ctiv

e in

NS

SE

C 1

+2

b) Regression based estimate [VOASIND]

0

100

200

300

400

500

600

0 100 200 300 400 500 600

Observed mean individual income (£ week)

Pre

dict

ed m

ean

indi

vidu

al in

com

e (£

wee

k)

c) Sub-group mean based estimate [LEINCM]

0

100

200

300

400

500

600

0 100 200 300 400 500 600

Observed mean individual income (£ week)

Pre

dic

ted

me

an

ind

ivid

ua

l in

com

e (

£ w

ee

k)

Surrogate/Estimate % NSSEC

1+2 Individual Regression

Sub-group mean

Ecological Regression

[NSSEC12] [VOASIND] [LEEINCM] % ranked in same decile as income

Overall 36 42 50 46

Within ± 1 decile 82 84 89 92

• < 1% of unexplained spatial variation in income attributable to area level effects

• House price has no significant impact– could be due to data problems

• Council tax band has small but significant effect [for areas of enumeration district size and below]

• Lack of utility counter-intuitive?– current value purchase price– purchase income current income

Other data sources

Conclusions (I)

• Best approaches capture 80-90% of spatial variation in income, even for smallest spatial units

• But considerable within-area heterogeneity

• Best approaches are regression or sub-group mean based

• Conventional deprivation indices a poor second to % social class / NS-SEC I+II

Conclusions (II)

• Geodemographic classifications at best perform as well as % NS-SEC I+II, and perform best for areas of ward size and above

• Qualified support for use of statistical distributions in modelling top income band means

ImplicationsMoral for marketers:

• Target people, not places

Moral for policy makers:

• Deprivation indices not the best proxy for income

• ONS ward income estimates (based on ecological regression) likely to perform well

Longer term • Consider external correlates

(e.g. IMD 2000; benefits data)

• Lobby for Census Office to create small-area income estimate– by imputing income on Census microdata– include non-census information (?)

Acknowledgements

• House price data were taken from the Experían Limited Postal Sector Data, ESRC/JISC Agreement.

• Grateful thanks are due to the Census Custodians of England, Wales and Scotland for granting permission to access the Census Rehearsal dataset.

• A debt of gratitude is also owed to a number at the Office for National Statistics, in particular Keith Whitfield and Philip Clarke.

• Finally, thanks are due to David Voas for undertaking some of the preparatory work for this project.

• All analyses and conclusions remain my sole responsibility.

Definitions (I)

• NS-SEC I+II: % persons aged 16-74 in NS-SEC I or II• Townsend: Multiple deprivation indicator based on % economically

active unemployed; % overcrowded households; % households with no car and % of households not owner occupied

• Green (Wealth): Affluence indicator based on % households with 2+ cars; % persons aged 16-74 in NS-SEC I and % adults with high educational qualifications

• PCA_96: Geodemographic classification based on principal components analysis of 20 normalised census variables, individuals in each of 96 area types assumed to have mean income of all persons in area type

• Voas: Alternative geodemographic classification, in which five census variables are divided into above or below median, one variable into thirds; with all cross-tabulated to give a total of 96 discrete area types

Definitions (II)

• Dale: Income imputed given mean income for population sub-group defined by sex, SOC 2000 minor group, economic activity (missing; employed full-time; employed part-time; self-employed; other), age (missing; 0-15; 16-19; 20-29; 30-49; 50+) [Maximum of 4860 valid sub-groups]

• Lee: Income imputed given mean income for population sub-group defined by SOC 2000 minor group, economic activity (child; not applicable; employed full-time; employed part-time; self-employed; unemployed; retired; other inactive) [maximum of 649 valid sub-groups]

Definitions (III)

• Voas (individual): Regression model for adult income (children assumed to have 0 income); INCOME0.5 predicted given: mean income by SOC2000 unit; mean income by Industry category, age, age2, residents, residents2, rooms and cars plus dummy variables for sex, white, full-time student, married, Single/Widowed/Divorced, Long-term ill, No qualifications, GCSE or equivalent, A levels or equivalent, Undergraduate degree or equivalent, employed full-time, employed part-time, self-employed, unemployed, retired, permanently sick, other economically inactive excluding pensioners and students, Semi-detached, terrace, flat, caravan, privately rented, social rented, employed manager or supervisor and district of residence

• Voas (household): Regression model for total household income; HHINC0.5 predicted given same set of predictors as for Voas (individual), but based only upon head of household’s characteristics

top related