danila filipponi simonetta cozzi istat, italy

Danila Filipponi

Simonetta Cozzi

ISTAT, Italy

Outlier Identification Procedures for Contingency Tables in Longitudinal Data

Roma,8-11 July 2008

► Starting from December 2006, ISTAT releases a statistical register

of local units (LU) of enterprises (ASIA-LU) , supplying every

year information on local units, available until the 2001 only

every ten years (Industry and Services Census).

► The set-up of the register have been carried out starting from an

administrative/statistical informative base of addresses

and using statistical models to estimate the activity status and

other attributes of the local units.

► ASIA-LU provides (mainly) the number of local units and local

units employees by municipality and economical activity.

What is the problem?Outlier in contingency tables

Non parametric approach – Median Polish

Correlated count data

Correlated count data -REGEE

Correlated count data-GEE

Simulation of Correlated count data

Outliers Detection in ASIA-UL

What is the problem?

Results

► Because of the nature of the available information, a selective

editing to identify possible anomalous counts (LU/employees) in

some combinations of the classification variables is indispensable

► The objective is to identify anomalous number of employees

and/or local units classified by municipality and economical

activity, taking into account the longitudinal information on LU, i.e.

the local units registers (2004-2005) and the Census surveys

(1991-1996-2001).

Outlier in contingency tables







1991 .. .. 2006

001 001 15 y11 .. .. y1J

001 001 17… … …..

107 23 85 yI 1 .. .. yI J

province code

Yearmunicipalit

y codeNACE 2002

The contingency table is:

Results



► Outlying observations in a set of data are generally viewed as

deviations from a model assumption:

the majority of observations -inliers- are assumed to come

from a selected model (null model);

few units – outliers- are thought of as coming from a different

model.

► The outliers identification problem is then translated into the

problem of identifying those observations that lie in an outlier

region defined according to the selected null model

)()(:)(supp),( iiiii KxfPxPout

iiii KxfxPKK }))(:({:0sup)(

where is a distribution family such and has density

and

if iP





Correlated count data -GEE



Results



Outliers in Contingency Tables

Let consider T categorical variables with possible outcomes

. Each combination

defines a cell of a contingency table.

TtI t ,.....1,,.....1 tTiT Iiwithiii ,.....1),,....( 11

Given a set of data, each observation belongs to a combination

and the frequency count of a cell can be denoted as

iiyi )(

Under a loglinear Poisson model, the cell counts are

considered as a realizations of independent Poisson

variables with expected values

iYi )(

iiI )(

iy

In a contingency table a cell count yi is view as outlier if

it occurs with a small probability under the null model.








Some Notation

Results


The values should be chosen in a way that the

probability that one or more outliers occurring in the

contingency table do not exceed a given value .

Assuming all the to be the same, then it can be shown

that I

i/1)1(1

i

i

► Assuming a Log linear Poisson model, the outlier region for

each cell count yi is defined as

)(!

:),( iyi

y

ii ken

Nyouti

ii

y

kni

y

i ey

ey

kki

i

i

!

1!

:0sup)( ,0


where N is the set of all non-negative integers and

► The cell count yi is then an if it lies in the

of Poisson’s distribution with parameters .

regionoutlieri

outlieri

i








Results


istat

► Loglinear models for contingency table are Generalized Linear

Models (GLM) where the expected cell count is

with X is a full rank design matrix and a parameter

vector.

)exp( TI X


► In the situation with only one measurement for each subject,

i.e. without a correlation structure, the classical estimator for

GLM is the maximum likelihood (ML) estimator.

Because of the nature of ML estimator, the regression

parameters estimates can be highly influenced by the

presence of outlying cells. Some robust alternative have

been proposed in literature.

► In practice to define the and identify the

outlying cells, it is necessary to estimate the vector of

parameters

regionoutlieri Outlier in contingency tables







Results



► A procedure that supplies robust estimates in the analysis of

contingency tables is the median polish method (Mosteller &

Tukey, 1977; Emerson & Hoaglin, 1983).

.








Results

► Given a contingency table with two factors, if an additive model

is assumed, the value can be can be expressed as the

sum of a constant term, an effect for level i of the row factor,

an effect for level j of the column factor, and a casual term:

ijjiij ey

ijy

► The median polish procedure operates in an iterative manner on

the table, calculating and subtracting row and column medians

and ends when all the rows and columns have a median equal to

zero.



► There are several way to extend GLMs to take into account the

correlation between subjects: marginal modeling approach

(GEE), random effects models for categorical responses

(GLMM), transitional models.

In longitudinal studies, repeated data looks like :

1, , 1, ,it iY i K t n where

' ',( ) ( ), ( ) ( ) ( , ) ( )it it it it it it i ttE Y Var Y v and Corr Y Y R








► Repeated responses on the same subject tend to be more alike

(generally positive correlated) then responses on different

subject. Standard statistical procedures that ignore the

between subjects correlation may produce invalid

results.

Results


Correlated count data - GEE

► A reasonable alternative to ML estimations for longitudinal count

data is a multivariate generalization of the quasi-likelihood.

Let

1, , 1, ,ii i inY Y Y i K

' '1, , 1, ,

ii i inX x x i K

( ) , ( )it it it itE Y g x

ni x p matrix of covariate

► Rather then assuming a distribution for the response variable Y,

in the quasi-likelihood method are specified only the

moments:

the mean which is a function of

the linear predictor

( ) ( )it itVar Y v the variance that depends on the

mean and a scale parameter

ni vector of outcome








Results


In the quasi-likelihood method, the estimate of the regression

and nuisance parameter are the solutions of the generalized

quasi-score function, called Generalized Estimating

Equation (GEE):

2

1

2

1

)( ARAV ii A is an ii nn diagonal matrix with)( ij the jth diagonal element)(iR is an ii nn correlation matrix

'

1

1

( , ) ( ) 0K

ii i i

i

V Y

The covariance matrix where:

Correlated count data - GEE








Results


Correlated count data - REGEE

► Because the QL estimators have properties similar to the ML

estimators, the regression and the nuisance parameters can be

influenced by outliers.

► Preisser and Quaqish (1999), in order to provide robust

estimation of , introduced a generalization of GEE which

include weights in the estimating equations in order to

downweight the influential observation.

► They define the resistant generalized estimating equation generalized estimating equation

(REGEE)(REGEE) as:

'

1

1

( , )[ ( , , , ) ( ) ] 0K

ii i i i i

i

V W X Y Y c








Results



),,,( iii yXW

where:

.itwis an ii nn diagonal weight matrix containing robustness weights

The weight have been chosen as function of the Pearson residuals,

to ensure robustness with respect to outlying

points in the y-space. We use as weight function

),(/)( 2/1itititit vyr

).)/(exp()( 2arrw

)( ii Ec

is a bias eliminating constant determined by the marginal

distribution of Y, where )( iiii yw








Results



► Robust estimators are also needed for the nuisance

parameters and to avoid consequences on the regression

parameters estimates

► If the moment estimations of and

are:

),(/)( 2/1itiitit vcr

k

i

n

t it

k

i

n

t itii pwr

1 11 1/

and

))1(/(1

11 11

pwrrk

ii

k

i itnt iti

where an autoregressive AR(1) working correlation matrix has

been specified (i.e ) jntYYCorr it

tjiij ,...,0),( ,








Results


Outliers identification procedures, based on previously

estimated parameters with the three different estimation

methods, have been compared in a simulation study.


In the study 4x4x5 tables are simulated

1, , 4 1, , 4 1, ,5ijtY i j t

,

( ) exp( ), ( ) ,

( , )

ijt ijt ij ijt ijt

tijt ijt h ij tt h

E Y x Var Y

Corr Y Y R

where

The parameter vector

and is a row of the design matrix X obtained as a dummy

coding

ijx(0.4, 0.6, -1, -0.3, 0.4, 1)








Results


► Correlated Poisson variables are simulated using the

overlapping sum (OS) algorithm (Park and Shin, 1998).


► If is a random vector with a mean and covariance

matrix , in the OS method is decompose in

yμ

y

Y

Y

TXY

where is an nxl matrix of 0’s and 1’s and is a l-vector of

independent Poisson variables.

The dimension l depends on the structure of the covariance

matrix and the matrix is defined in a way that has the

proper mean vector and covariance matrix

T X

T Y

xy T

► Once is defined the means of can be obtained solving

the equation

XT








Results


Simulation scheme

number of simulated

tables

tables dimention

low 100 4x4x5medium 100 4x4x5

high 100 4x4x5low 100 4x4x5

medium 100 4x4x5high 100 4x4x5low 100 4x4x5

medium 100 4x4x5high 100 4x4x5low 100 4x4x5

medium 100 4x4x5high 100 4x4x5

0,05repleced

value

0,01repleced

value

types

rhoY

number of

outliers

number of

outliers0,8

0,1

0,01

0,05repleced

value

repleced value

Outliers in the simulated tables are produced by replacing the

selected cell Yijt by

Max(inl(α,μij))+1 or Min(inl(α,μij))-1

where α has been chosen as (10-2, 10-4, 10-8)









Results


Results








10-2 10-4 10-8 10-2 10-4 10-8 10-2 10-4 10-8 10-2 10-4 10-8

Intercept 1,988 2,027 1,992 2,011 2,040 2,080 1,987 2,025 2,002 2,029 1,999 2,007var1 - 1 -0,300 -0,274 -0,297 -0,295 -0,264 -0,286 -0,319 -0,280 -0,325 -0,291 -0,298 -0,271 -0,269var1 - 2 0,400 0,421 0,394 0,392 0,404 0,379 0,359 0,415 0,412 0,421 0,357 0,407 0,408var1 - 3 1,000 1,008 0,988 0,993 0,978 0,959 0,944 0,998 1,002 1,002 0,939 0,973 0,984var2 - 1 0,400 0,389 0,395 0,409 0,379 0,383 0,394 0,394 0,368 0,396 0,384 0,392 0,395var2 - 2 0,600 0,587 0,586 0,603 0,591 0,581 0,569 0,610 0,566 0,574 0,599 0,596 0,585var2 - 3 -1,000 -1,001 -0,989 -0,952 -0,950 -0,956 -0,906 -1,008 -0,986 -0,966 -0,977 -0,955 -0,907

rho -0,077 -0,087 -0,079 -0,084 -0,084 -0,105 0,427 0,291 0,267 0,112 0,153 0,016

10-2 10-4 10-8 10-2 10-4 10-8 10-2 10-4 10-8 10-2 10-4 10-8

Intercept 1,975 2,002 1,978 1,968 1,996 1,979 1,981 1,995 1,984 1,997 1,955 1,921var1 - 1 -0,300 -0,276 -0,305 -0,304 -0,293 -0,311 -0,339 -0,287 -0,325 -0,304 -0,349 -0,270 -0,324var1 - 2 0,400 0,424 0,403 0,395 0,419 0,394 0,395 0,415 0,422 0,425 0,361 0,417 0,425var1 - 3 1,000 1,017 1,005 1,004 1,008 0,987 1,015 1,005 1,021 1,013 0,961 1,012 1,044var2 - 1 0,400 0,393 0,400 0,412 0,397 0,409 0,420 0,396 0,385 0,405 0,409 0,391 0,431var2 - 2 0,600 0,593 0,594 0,607 0,608 0,610 0,607 0,608 0,581 0,582 0,610 0,615 0,604var2 - 3 -1,000 -1,023 -1,037 -0,980 -1,031 -1,030 -1,036 -1,032 -1,031 -0,989 -1,038 -1,026 -1,002

rho -0,066 -0,083 -0,068 -0,080 -0,066 -0,085 0,506 0,529 0,346 0,302 0,587 0,390

Parameter

rho=0,8 %outlier=0,01

Simulated value

ESTIMATE-REGEE

rho=0,1 %outlier=0,01 rho=0,1 %outlier=0,05 rho=0,8 %outlier=0,01 rho=0,8 %outlier=0,05

ESTIMATE-GEE

Parameter Simulated

valuerho=0,8 %outlier=0,05rho=0,1 %outlier=0,01 rho=0,1 %outlier=0,05

Results


Results








Results

Proposition of tables whose outliers are correctly identified

0

20

40

60

80

100

mp

gee

regee

mp

gee

regee

mp

gee

regee

p=10- 2 p=10- 4 p=10- 8

0

20

40

60

80

100

mp

gee

regee

mp

gee

regee

mp

gee

regee

p=10- 2 p=10- 4 p=10- 8

0

20

40

60

80

100

mp

gee

regee

mp

gee

regee

mp

gee

regee

p=10- 2 p=10- 4 p=10- 850-70

>70

0

20

40

60

80

100

mp

gee

regee

mp

gee

regee

mp

gee

regee

p=10- 2 p=10- 4 p=10- 8

Ρ=0,1 %outliers=0,05 Ρ=0,1 %outliers=0,01

Ρ=0,8 %outliers=0,05 Ρ=0,8 %outliers=0,01



The outlier identification procedures have been applied in the control

process of the Statistical Register of the Local Units (ASIA-UL).Outlier in contingency tables







MP GEE REGEE MP GEE REGEE0 952 927 983 84,25 82,04 86,991 178 203 147 15,75 17,96 13,01

1130 1130 1130 100 100 100

Number of outlying cells identified by estimation methods

Concordances/ discordances in the outliers identification procedures

0 1 total 0 1 total0 77,08 7,16 84,24 0 82,12 2,13 84,251 4,96 10,8 15,76 1 4,87 10,88 15,75

total 82,04 17,96 100,00 total 86,99 13,01 100,00

GEE REGEE

MP MP

Results










1991 1996 2001 2004 2005

055 004 36 1 0 27 15 10 9 9055 004 70 1 0 1 4 14 12 12055 004 74 1 0 108 105 108 138 148055 023 45 1 0 816 833 948 1005 1068055 032 17 1 0 190 233 267 273 266055 032 33 0 1 229 202 229 160 162055 032 52 0 1 4135 4135 4129 3986 4299055 022 14 0 1 7 18 1 5 9055 022 25 0 1 97 70 66 45 56055 023 15 0 1 223 274 196 236 211

REGEEMPprovince

codemunicipality

codeNACE 2002

Results


danila filipponi simonetta cozzi istat, italy

Documents

cell count yi

set of data

units outliers

local units employees

number of local units

asiaulthe contingency

outlier region

selected model null