philip clarke and denise silva

Post on 05-Jan-2016

29 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Development of Small Area Estimation at ONS. Philip Clarke and Denise Silva. Outline. Small Area Estimation Problem History and current provision Development in progress Wider research Consultancy service. 1. Small Area Estimation Problem. - PowerPoint PPT Presentation

TRANSCRIPT

1

Philip Clarke and Denise Silva

Development of Small Area Estimation at ONS

2

Outline

1. Small Area Estimation Problem

2. History and current provision

3. Development in progress

4. Wider research

5. Consultancy service

3

1. Small Area Estimation Problem

• “Official statistics provide an indispensable element in the information system of a democratic society” (Fundamental Principles of Official Statistics, UNSD )

• Sample surveys are used to provide estimates for target parameters on population (or National) level and also for subpopulations or domains of study

• However implementation in a Small Area Context is challenging

4

Small Area Estimation Problem

• In small areas/domains sample sizes are usually not large enough to provide reliable estimates using classical design based methods.

• Small area estimation problem refers to SMALL SAMPLE SIZES (or none at all) in the domain or area of interest.

5

2. History

• Small Area Estimation in UK begun as research project in late 1990s.

• In response to calls for locally focussed information in many different areas :Environmental

Business

Social, e.g. health, housing, deprivation, unemployment.

• Also calls for more general domain estimation;– e.g. cross classifications by age/sex, occupation.

• Initial experimental studies on mental health estimation for DoH.

6

Developing alternative methodology

• Purpose :– To enable production of reliable estimates of characteristics of

interest for small areas or domains based on very small or no

sample.

– To asses the quality (precision) of estimates.

• Several years of research and development (since 1995)– Partnership work with universities and Statistics Finland

– The EURAREA project:

Research programme funded by Eurostat to ‘enhance

techniques to meet European needs’ (from 2001-2004)

7

Basis of Approach: Relax the Survey Restriction

• ‘Borrow strength’ by removing the isolation of

depending solely on the survey and solely on

respondents in a given area.– Widen the class of respondents for a given area by pooling together similar areas.

– Widen the class of respondents by taking past period respondents into account.

– Take advantage of other related data sources which are not sample survey based.

• Known as auxiliary data.

e.g. Administrative data or census data which are available for all areas/domains.

8

Model based estimation

• All approaches detailed are based on an implicit or explicit model.

• The auxiliary data and use of survey data from all areas is the approach currently adopted in UK.– Borrows strength nationally.

– Uses an explicit statistical model to represent the relationship

between the survey variable of interest and auxiliary data. Dependent variable is survey variable of interest.

Independent variables are certain auxiliary data variables known

as covariates.

Model fitted using sample data and assumed to apply generally.

Model then used in the obtaining of area/domain estimates.

9

Outline of a model structure

• Suppose variable of interest, Y, in an area j is linearly related to a single covariate X

• A possible model structure is given by :

where is the mean of Y in area j

• This is a deterministic structure, so we need to add some random variability

j jY X

jY

10

• Obtain

• uj represent random area differences from the deterministic value.

• represents variability between areas.

j j jY X u ),0(~ 2uj Nu

2u

11

Model fitting

• Fit the model using direct survey estimates for each area.

• This introduces additional sampling variability.

• Unit level sampling variability

giving rise to additional area level sampling variability

j j j jy X u e 2~ (0, )ij ee N

2~ (0, )j e je N n

12

Estimating from the model

• Once the model is fitted, estimate for area j by using parameter estimates :

jjj uXy ˆ ˆˆˆ

13

Estimating from the model

• Once the model is fitted, estimate for area j by using parameter estimates :

• Estimate of mean squared error given by

jjj uXy ˆ ˆˆˆ

2ˆ)ˆ(ˆujyESM )ˆ,αv(oC2)ˆ(raV)αr(aV 2 jj XX

14

Estimating from the model

• Once the model is fitted, estimate for area j by using parameter estimates :

• Estimate of mean squared error given by

• Modelling success measured by obtaining estimates with high precision based on low mean squared errors.

jjj uXy ˆ ˆˆˆ

2ˆ)ˆ(ˆujyESM )ˆ,αv(oC2)ˆ(raV)αr(aV 2 jj XX

15

Current provision

• SAEP – a generic methodology for application to variables from household based surveys. – Mean household income based on Family

Resources Survey published as Experimental Statistics for wards in 1998/99, 2001/02 and for middle layer super output areas 2004/05

• Specialised methodology for labour market estimation of unemployment from Labour Force Survey.– Unemployment levels and rates routinely

published quarterly as National Statistics for Local Authority Districts in Great Britain.

16

SAEP methodology and income estimation

SAEP methodology is -:• derived from outlined model-based approach,

BUT is• based on a unit (household)/area multilevel model;• borrows strength across areas using multivariate

area level auxiliary data (covariates);• can model transformation of variable of interest if

required;• adapted for estimating at ward/middle layer super

output area (MSOA) from customary ONS clustered design household sample surveys;

17

Application to income estimation- Response Variable

• Income value for each household sampled in Family Resources Survey (FRS).~ 3,300 MSOAs in England and Wales with sample

in 2004/05,

~ 21,500 total responding households.

• But not a simple random sample.– Clustered design with primary sampling units as

postcode sectors,

~ 1,500 sampled postcode sectors.

18

Coping with design clustering

• Samples are random samples of postcode sectors; – So random terms are around postcode sectors,

indexed by j

• Estimation is required for geographically distinct wards or middle layer super output areas;– So covariates are for these areas, indexed by d– For estimation, covariates must be known for all

areas not just sampled areas.

19

SAEP model and estimator structure for income estimation

• Multilevel structure gives rise to unit level random term replacing area sampling variability

• Logarithmic transformation of income taken because of positive skewness of income distribution

• Model : id d j ijlog y X u e

ije

ie

20

SAEP model fitting procedure

• Create a dataset containing :– Variable of interest from individual household

responses to survey.– values of a large number of administrative and

census variables for the particular household area of residence which we believe could impact on variable of interest, eg census variables, DWP social benefit claimant rates, council tax band proportions

21

SAEP model fitting procedure (cont.)

• Starting with a null model, fit covariates in a stepwise manner in order of significance by using specialised multilevel software – eg. MLwiN or SAS PROC MIXED.

• In this way select a set of significant covariates and fit an accepted model.

• Use diagnostic techniques to investigate model against assumptions eg. Randomness of residuals, unbiasedness of predictions.

22

Estimator and mean squared error

• Estimator on log income scale :A synthetic estimator is used omitting the random

area terms :

ˆˆ d dlog y X

23

Estimator and mean squared error

• Estimator on log income scale :A synthetic estimator is used omitting the random

area terms :

• Mean squared error

ˆˆ d dlog y X

2ˆ ˆTd d uX Var X

24

Converting to raw income scale

• Need to make allowance for

mean(log) log(mean)

• Area estimate2 2ˆ ˆˆˆ exp

2u e

d dy X

25

Converting to raw income scale

• Need to make allowance for

mean(log) log(mean)

• Area estimate

• Confidence interval

2 2ˆ ˆˆˆ exp2

u ed dy X

12 2

2 2ˆ ˆˆexp ˆ ˆ1.96

2Tu e

d d d uX X Var X

26

Actual model for ward estimation of income in 2004/05

ˆlog

.................

6.01 0.76

0.18 0.13

0.58 0.

...............

72

.

d d

d d

d d

phrpman

lnphrpecac lnphhtyp

inco

e1

engegh pcgeo

me

x

x x

x x

phrpman = proportion of household reference persons aged 16-74 who are in professional or managerial occupations.lnphrpecac = logit of proportion of household reference persons aged 16-74 who are economically active.lnphhtype1 = logit of proportion of one person households.engegh = proportion of council tax band G&H dwellings for England.pcgeo = proportion of people aged 60 and over claiming pension

credit (guarantee element only) .

27

28

Income estimation outputs

• Estimates obtained of sufficient precision for publication and acceptable to user community.

• Accredited as Experimental Statistics• Placed on Neighbourhood Statistics website

together with user guides and technical documentation.

29

Estimation of unemployment at local authority level

BACKGROUND• Unemployment is a key indicator and is used for

policy making and resource allocation

• Official UK measure of unemployment follows the International Labour Organisation Definition (ILO)

• ILO unemployment is estimated via the Labour Force Survey (national level)

• Small (local) sample sizes in the LFS for some areas

30

Features of Labour Force Survey

• A rotating panel survey– Roughly 60,000 households surveyed each quarter– Each household remains in sample for 5 quarters

(waves 1 to 5) then drops out

• Waves 1 and 5 respondents for last four quarters used to obtain an annual ‘local labour force survey’ dataset of about 90,000 independent households.

• Unclustered survey design – giving a sample in each LAD.

31

Features of unemployment modelling

• Unclustered LFS design means

– direct estimates available for each LAD

– availability of estimated random area terms in LAD estimation

• However– low precision of direct survey estimates due to small sample

sizes– need for better precision model-based estimates

• Availability of a highly correlated covariate – number of claimants of unemployment benefit/job seekers allowance

– Eliminates need for model fitting to a range of possible covariates on each occasion.

32

The small area estimation model

A LOGISTIC multilevel model by local authority (d) and six age/sex classes (i). It relates the probability pdi of an individual to be unemployed.

Response variable: proportion of unemployed individuals in LFS in age/sex class of local authority (logit transformed).

Covariate data• Benefit data: the logit of the claimant proportion of job seekers

allowance in each age/sex class within each local authority and also for overall age/sex classes;

• The age/sex class: male/female for age groups (16 to 24; 25 to 49; 50 and over)

• Geographical region: the 12 government office regions (GOR)

• ONS area classification : 7 categories under the National Statistics Area Classification for Local Authorities

33

• The model used to link pid with the auxiliary data is a Binomial linear mixed model with a logistic link function

Area random effect

logit ln1

Tidid id d

id

pp X u

p

β2~ (0, )d uu N

34

Estimator from model

• The model-based estimator of proportion unemployed in each age/sex group of each LAD is then given after fitting model by :

• Note the use of the term in the estimator as it is now available for each LAD.

ˆ ˆexpˆˆ ˆantilogit

ˆ ˆ1 exp

Tid dT

id id dTid d

up u

u

x βx β

x β

du

35

• Model has estimated a proportion at each age/sex group

• This is converted into an estimate of unemployment level at each LAD by :– multiplying each proportion estimate by the LFS estimate of

population unsampled

– adding those sampled and found unemployed

– summing the age/sex group estimates

Final Estimator for unemployment level for area d is:

Model-based estimate for Unemployment

6 6

1 1ˆ ˆ ˆ ˆd id sid id id idi iY Y y N n p

6 age-sex groups

36

LAD Estimation of unemployment rate

• The estimate of unemployment rate is obtained using model-based estimate of unemployment level and the direct estimate of employment :

Direct survey estimate of

Employment

dd

dd

EY

Yr

Model-based estimate of

Unemployment

37

Precision of Estimates

• The mean squared error (MSE) for the unemployment level estimates in LAD d is given by several components

• G1 and G2 come from the uncertainty in estimating the coefficients and

u in the model

• G3 arises because we have estimated the variance of u

• G4 is necessary because the model estimates actual values rather than

means

• G5 is the additional variance component due the estimation of population

size in each LAD

54321d GGGGG)Y(MSE

β

)ˆ( dN

2u

38

Unemployment estimates publication

• The standard errors of the model based estimates found to be smaller than the corresponding direct standard errors in each LAD.

• Model-based estimates have been accredited as National Statistics and now published quarterly in Labour Market statistics releases.

(http://www.statistics.gov.uk/StatBase/Product.asp?vlnk=14160)

39

3. Developments in progress

Labour Market area

– Consistent estimation of all three labour market states: - employed, not economically active, unemployed

– Currently… Local Authority labour market estimates are:

• Model-based estimates for unemployment

• Direct survey estimates for economically inactivity and employment figures

• Now developing a multivariate model to estimate concurrently number of unemployed, employed and economic inactive people by local authority

40

Compositional data

• The proportions of individuals classified in each category are: Proportions bounded between 0 and 1 and

subject to a unity-sum constraint.

Multinomial Logistic model to relate labour market probabilities with auxiliary data for all categories is therefore defined with only 2 equations.

41

Multinomial Logistic Model

11 1 1

3

mlogit( ) ln Tidid id d

id

pp u

p

x β

22 2 2

3

mlogit( ) ln Tidid id d

id

pp u

p

x β

42

Multinomial Logistic Model

11 1 1

3

mlogit( ) ln Tidid id d

id

pp u

p

x β

22 2 2

3

mlogit( ) ln Tidid id d

id

pp u

p

x β

1 1

1 2

1

exp( )

1 exp

Tid d

idTid j dj

j

up

u

x β

x β 2 2

2 2

1

exp( )

1 exp

Tid d

idTid j dj

j

up

u

x β

x β

Then:

3 2

1

1

1 expid

Tid j dj

j

pu

x β

43

The Model

• Relates the probabilities of labour market states to following predictors:

• age/sex group ; Geographical region and ONS area classification:

• Benefit data: claimant proportions (JSA) and incapacity benefit

• Other variables will be tested (e.g. income support)

46

Developments in progress (cont.)

Labour Market area– Unemployment estimation at Parliamentary

constituency level

• Non-nested geography but with certain matching areas

• Issue here is to ensure consistency with local authority

estimates at comparable areas

• Model developed and estimates likely to become

available in the coming year

47

Developments in progress (cont.)

Income estimation– Estimation at local authority level

• Clustered survey design entails a modification of SAEP framework to cater

• Currently in development

– Estimation of poverty: proportion households below threshold

• Currently being developed for MSOA/local authority level

48

4. Wider research activities

In conjunction with academic partners– Estimation of change over time

Current work is confined to single point-in-time estimation but users would like indication of progress over time – particular in relation to funding

– Estimation of poverty using M-quantile modelling

Research using FRS data by Nikos Tzavidis

– Models incorporating spatial relationshipsPreliminary investigation of spatial relationship in

unemployment model in conjunction with Ayoub Saei at Southampton University

Link with work at Imperial College by Nicky Best and Virgilio Gomez-Rubio

49

5. Methodology Consultancy Service

ONS is currently establishing a methodology consultancy service

– To undertake and support statistical work by other government departments and public sector organisations.

– Resource for assessment/quality improvement

– Currently working with Health and Safety Executive on small area estimation of incidence of work related illness at local authority level.

50

References• Small Area Estimation Project Report. Model-Based Small Area

Estimation Series No.2, ONS, January 2003• Developments in small area estimation in UK with focus in current

research. Clarke, P., Mcgrath K., Chandra, H., Tzavidis, N. (2007). IASS Satellite Meeting on Small Area Estimation, Pisa.

• Model Based Estimates of Income for Middle Layer Super Output Areas 2004/05 Technical Report, ONS, September 2007

http://neighbourgood.statistics.gov.uk/HTMLDocs/images/Technical Report 2004_05 v2 - Final_tcm97-53513.pdf http://neighbourhood.statistics.gov.uk/dissemination/MetadataDownloadPDF.do?downloadId=21704

• Development of improved estimation methods for local area unemployment levels and rates. Labour Market Trends, vol. 111, no 1www.statistics.gov.uk/cci/article.asp?id=372

• Summary publication accompanying the publication of the 2003 unemployment estimates November 2004http://www.statistics.gov.uk/downloads/theme_labour/ALALFS/AnnexA.pdf

top related