anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/anatomy of a...

Anatomy of a rare event model:sampling, estimation, and

inference

Jagat Sheth

St. Louis RUGNovember, 2012

Abstract

This presentation will discuss typical stages in the workflow of modeling rare events.It will include a little theory along with R code for sampling, estimation, prediction,and fast grouping for ad hoc diagnostics.

Contents

Contents 3

1 Introduction 5

2 Workflow 7

3 R DBI 93.1 RODBC 103.2 Sampling queries 11

4 Design Weights 134.1 R code 14

5 Modeling 165.1 Specification 185.2 Estimation 235.3 Prediction etc. 26

3

6 Fast grouping 34

A Code 38

B References 42

4

1 Introduction

• Rare events: Rare event data is characterized by an abundance of noneventsthat far exceeds the number of events, e.g., 10, 000 events and 1, 990, 000nonevents yielding a minikin 0.5% event rate

• Issues: Processing and analyzing all nonevents is costly in terms of computertime and resources, and the efficiency of the resultant rare event model

• Three-step solution: Use a designed sample

1. sample all events

2. sample down the nonevents to a multiple of the number of events

3. weight the reduced sample to reflect the population event rate

5

The three-step approach is easy to understand; however, this approach can only beuseful with appropriate statistical corrections.

Here we’ll discuss a little theory along with simple R code for

• rare event sampling via RODBC + sample design weighting

• estimation and inference for modeling and prediction

– non-linear models via nlm and Jim Lindsey’s gnlm

– robust standard errors via sandwich

– frequentist vs. Bayesian

• fast grouping for ad hoc diagnostics R package data.table

6

http://cran.r-project.org/package=RODBC

http://www.commanster.eu/rcode.html

http://cran.r-project.org/package=sandwich

http://cran.r-project.org/package=data.table

2 Data + modeling workflow

Database

R DBI e.g. RODBC��

Stratifiedrandomsample

small subset

{{

big subset

##

TRAIN

R’s nlm ��

TEST

updatepp

��

R’s data.tablefast grouping

""

Model

predict00

X1 Xn

7

E.g. Company XYZ currently uses an IBM DB2 V9.2 database server

Table Total Events rateBig data 848 Mil 13.6 Mil 1.6%

Table has 200+ columns holding monthly event history data since Jan 2000.

On a typical day, approx 1 hr to pull all events, and 2 hr to randomly sample twicethe number of non-events.

8

3 R Database Interface (DBI)

• Several options exist for communication between R and relational databasemanagement systems

– RODBC, RMySQL, ROracle, RPostgreSQL, RSQLite, RJDBC, RMonetDB,etc.

• Here, we used RODBC to talk to an IBM DB2 database server

Note: Installation of RODBC requires an ODBC Driver Manager. Windows normallycomes with one. Mac OS X since 10.2 has shipped with iODBC. But for other systemsthe driver manager of choice is unixODBC with sources downloadable from http:

//www.unixodbc.org.

9

http://www.unixodbc.org

http://www.unixodbc.org

3.1 RODBC

## Open connection to your ODBC database

require(RODBC)

con <- odbcConnect(dsn, uid = "xxx", pwd = "xxx", ...)

## Submit an SQL query

pop <- sqlQuery(con, getTotals)

cases <- sqlQuery(con, getStrata(1))

controls <- sqlQuery(con, getStrata(0,frac=0.03))

res <- rbind(cases, controls)

close(con)

Build up SQL query using paste in R

10

3.2 Sampling queries

getTotals <- paste("SELECT COUNT(*) as total,",

"SUM(event) as cases",

"FROM ... WHERE ...")

getStrata <- function(val=1, frac=NULL){

myQuery <- paste("SELECT", ...,

"FROM", ...,

"WHERE", "event=", val, "AND ...")

if(!is.null(frac))

myQuery <- paste(myQuery, "AND RAND() <=", frac)

myQuery

}

RAND() can be replaced with something more efficient if available on your DB.

11

R function to help

require(sqlsurvey)

sqlsubst("SELECT %%vars%% FROM %%table%% GROUP BY %%strat%%",

list(vars=varnames, table=tablequery, strat=strata))

from package sqlsurvey (experimental on R-forge)

12

http://sqlsurvey.r-forge.r-project.org/

4 Design Weights

The design weights represent the inverse probability of unit selection into the samplederived from stratified simple random sampling. If a population of size N is stratifiedby a binary variable Y with K ones, and sample size is n with k ones, then

s1 = P (unit in case-control | Y = 1) = 1

s0 = P (unit in case-control | Y = 0) =n− kN −K

and

w1 = 1/s1 = 1 (1)

w0 = 1/s0 =N −Kn− k

(2)

The design weights can be generated by a single coding line

wi = w1Yi + w0(1− Yi) (3)

13

4.1 Simple R code

## Design weights for case-control

N <- pop$TOTAL

K <- pop$CASES

n <- nrow(res)

k <- sum(res$event)

res$wt <- with(res,1*event+(N-K)/(n-k)*(1-event))

14

Alternatively, can use normalized weights (which sum to n)

w1 =τ

yand w0 =

(1− τ )

(1− y)

where τ = K/N, y = k/n are the fraction of events in the population and sample,respectively.

tau <- K/N; ybar <- k/n

See (King and Zeng, 2001) or (Breslow, 1996) for further details.

15

5 Modeling Workflow

• Recall we are using the stratified case-control sample in conjunction with itsassociated design weights as an efficient representation of the (much larger)population

• Regressions, aggregations, and summary statistics are now WEIGHTED by thedesign weight

• Iterative process of model development

– an initial model specification is estimated or updated on a (relatively)small TRAINING set

– score models on a larger (out-of-sample and possibly out-of-time) TESTset

– group (aggregate) scores and actuals to assess performance by several keydriver dimensions X1, . . . , Xn

– iterate these steps until satisfied, making sure model strengths and weak-nesses are well understood

16

TRAIN

R’s nlm ��

TEST

update

tt

��

R’s data.tablefast grouping

""

Model

predict

33

X1 Xn

ALL MODELS ARE WRONG BUT SOME MAY BE USEFUL

George E.P. Box

17

5.1 Model specification

Weighted Log-likelihood

Let f (y;µ, θ) = probability density for Y with location parameter µ. Binomial orPoisson are typical for rare events. Model the location parameter as

µi = g(β,xi) (4)

for some (parametric) function g, drivers xi, and coefficients β. Then use maximumlikelihood to estimate β (and θ if f needs it; not required for binom or poiss)

Maximize the weighted log-likelihood

logLw(β, θ|y) =∑i

(wi log f (yi;µi, θ))

where the weights wi are given in Equation 3 and log is natural log.

18

Advantages

• All that needs to be done is calculate wi in Equation 3, choose it as the weightin your computer program, and then run a model, e.g. logistic regression.

• Weighting can outperform alternative procedures such as prior correction whenboth a large sample is available and the functional form is misspecified.

Disadvantages

• The usual method of computing standard errors is severely biased.

One way to address this disadvantage is by using empirical design-based stan-dard errors e.g. Huber-White sandwich covariance estimator or a complexsurvey design package.

19

http://cran.r-project.org/package=survey

Example spec of a mortgage prepayment model

Two component model for an FHA portfolio with each component modeled as amultiplicative hazard assuming a Poisson distribution.

spec <- function(p){

HT <- p[1] + ll.haz(age, p[2]) + log(seasfix30) +

mono.haz(chgequity,p[7],p[8]) + p[15]*lockin

RF <- p[3] + log(expit(eit, p[4],p[5])) +

ll.haz(age,p[6]) + p[9]*(vint==1) + p[10]*(vint==2) +

p[11]*dr12 + p[12]*log(zato) +

log(expit(fico,p[13],p[14])) +

p[16]*burnout + p[17]*back

exp(HT) + exp(RF)

}

HT = ‘housing turnover’, RF = ‘refinance’

20

−2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

RF: S−curve

economic incentive at time tEIT = NOTE − MKT

prep

ay p

roba

bilit

y (S

MM

)

0 50 100 200 300

0.01

0.03

0.05

HT: Aging curve

Loan term of 360 monthsLoan age

prep

ay p

roba

bilit

y (S

MM

)

0 50 100 200 300

0.2

0.4

0.6

0.8

1.0

1.2

RF: Aging multiplier

multipliers normalized at age=30Loan age

mul

tiplie

r

0.5 1.0 1.5 2.0 2.5 3.0

01

23

4

HT: Chgequity multiplier

multipliers normalized at chgequity=1.25chgequity=oltv/cltv

mul

tiplie

r

21

## log hazard for log-logistic distribution

ll.haz <- function(t,m,s){

-(log(t) + log(1 + exp(m/s)*t^(-1/s)))

}

## monotonic hazard

mono.haz <- function(t,a,b){

- exp(a)*exp(-exp(b)*t)

}

## inverse-logit (S-curve)

expit <- function(x,a,b) {

1/(1 + exp(-(a + b*x)))

}

22

5.2 Estimation via nlm

• The model spec is highly nonlinear, therefore neither linear nor generalizedlinear model methods of estimation will apply.

• Here, spec is parametric and relatively parsimonious, so maximum likelihoodestimation was feasible.

• The basic procedure for obtaining maximum likelihood estimates of the unknownparameters in the likelihood function is to solve the score equations obtainedby setting the first derivative of the (weighted) log likelihood function, withrespect to these parameters, equal to zero.

• The most common iterative procedure to solve them is some variation ofNewton-Rhapson which requires the second derivatives of the log likelihoodfunction.

• R’s nlm function carries out a minimization using a Newton-type algorithm andcalculates both sets of derivatives numerically so that only the (negative) loglikelihood need be supplied.

23

Used Jim Lindsey’s gnlm package which provides a variety of functions to fit linearand nonlinear regression with a large selection of distributions. Also see (Lindsey,1999).

attach(train)

zp <- gnlr(y,"Poisson",mu=spec,pmu=...,wt=...)

cov.empirical <- sandwich(zp)

se.empirical <- sqrt(diag(cov.empirical))

detach(train)

24

cbind(estimate=zp$coef, se.model=zp$se, se.empirical)

estimate se.model se.empirical

p[1] 0.33801 0.58616 0.168391

p[2] 2.39070 0.71086 0.172019

p[3] -0.64416 0.80647 0.282599

p[4] -2.42126 0.76439 0.274128

p[5] 2.32347 0.95267 0.309260

p[6] 2.81824 0.29312 0.088307

p[7] 5.55685 2.18565 0.618194

p[8] 1.43363 0.49618 0.137823

p[9] 1.96005 0.41008 0.153742

p[10] 1.09278 0.41019 0.154422

p[11] -2.43484 2.12904 0.757065

p[12] 1.01519 0.26680 0.108856

p[13] -8.41495 6.55427 2.239078

p[14] 9.88181 8.23264 2.835693

p[15] 0.27840 0.67309 0.162903

p[16] -0.01251 0.01226 0.004041

p[17] -0.15452 0.37606 0.128069

25

5.3 Prediction, Validation, and Inference

Prediction

• Generate model’s predictions on newdata, e.g. a TEST set

• Unlike objects of class ‘lm’, ‘glm’ there is no predict.gnlm

• Do this from scratch by evaluating an R call or expression of this model spec,in the environment newdata (a data frame), along with an optional enclosure

encl <- list2env(list(p=coef(zp)))

pred <- eval(body(spec), envir=newdata, enclos=encl)

Note: eval copies newdata into a temporary environment (with enclosure‘enclos’), and the temporary environment is used for evaluation.

26

• Alternatively, try fnenvir from package rmutil

func <- fnenvir(spec,.envir=newdata)

pred <- func(p=coef(zp))

fnenvir finds the covariates and parameters in a function and can modify it so thatthe covariates used in it are found in the data object specified by .envir.

This avoids copying newdata into a temporary environment for evaluation.

27

http://www.commanster.eu/rcode.html

fnenvir(spec, .envir=newdata)

model function:

structure({

HT <- p[1] + ll.haz(newdata$age, p[2]) + log(newdata$seasfix30) +

mono.haz(newdata$chgequity, p[7], p[8]) + p[15] * lockin

RF <- p[3] + log(expit(newdata$eit, p[4], p[5])) + ll.haz(newdata$age,

p[6]) + p[9] * (newdata$vint == 1) + p[10] * (newdata$vint ==

2) + p[11] * newdata$dr12 + p[12] * log(newdata$zato) +

log(expit(newdata$fico, p[13], p[14])) + p[16] * newdata$burnout +

p[17] * newdata$back

exp(HT) + exp(RF)

}), srcfile = <environment>, wholeSrcref = structure(c(1L, 0L,

6L, 0L, 0L, 0L, 1L, 6L), srcfile = <environment>, class = "srcref")

covariates:

age seasfix30 chgequity eit vint dr12 zato fico burnout back

parameters:

p[1] p[2] p[3] p[4] p[5] p[6] p[7] p[8] p[9] p[10] p[11] p[12] p[13] p[14] p[15] p[16] p[17]

28

Validation

This includes

1. predicting on newdata (a TEST set)

2. grouping predictions and actuals by key driver dimensions to assess perfor-mance

Here, newdata has 1 million rows including both out-of-sample and out-of-time ob-servations.

TRAIN set used only 10K observations.

29

2000 2002 2004 2006 2008 2010 2012

0.01

0.03

0.05

0.07

Asset Date

Time

SM

M

actualvintclpm

0 100 200 300

0.00

0.02

Age

Loan Age

SM

M

actualvintclpm

50 100 150

0.01

00.

020

LTV

Current LTV

SM

M

actualvintclpm

550 600 650 700 750 800

0.00

40.

010

FICO

Current FICOS

MM

actualvintclpm

−3 −2 −1 0 1 2 3

0.00

00.

015

0.03

0

EIT

EIT=note − market

SM

M

actualvintclpm

30

Inference

• In addition to model selection criteria like AIC, BIC, and all that, its useful tocompute standard errors for model predictions.

• Standard errors are useful if TRAIN was not“too large”(bias vs. variance trade-off; “too large” sample => variance almost 0, leaving only bias)

• How to compute for a nonlinear function f (β)? Delta method is a commonapproach which gives a first-order approximate standard error for a nonlinearfunction of estimated coefficient vector β as follows

hChT

where h = Jf(β) the Jacobian of f and C = Cov(β) is estimated covariancematrix (possibly robust)

• Here, we used numerical derivatives to compute Jacobian although analytic ispossible albeit messy.

31

2000 2002 2004 2006 2008 2010 2012

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Actuals with prediction intervals

Delta−method implementation for prediction uncertainty.Time

SM

Mpred ± 2se actual

pred

32

Are conventional statistics anything other than misleading?

. . . conventional [frequentist] confidence limits correspond to posterior(Bayesian) limits under the absurd fantasy that the data are from a per-fect randomized trial conducted on a random sample of the target population,and that any deviations from this ideal are inconsequential . . .

. . . These problems can be addressed by examining the entire distributionproduced by a simulation, not just a pair of limits (since the choice of 95% lim-its is nothing more than a social convention), and by asking what other sourcesof uncertainty and information ought to be incorporated into the distributionbefore taking it seriously. (Greenland, 2004)

Remember that all models are wrong; the practical question is how wrong do

they have to be to not be useful. (Box and Draper, 1987)

33

6 Fast Grouping

Try R package data.table for fast grouping and an alternative to tapply

Weighted averages exampleUsing tapply

system.time(with(TEST,

tapply(y*zw,date,sum)/tapply(zw,date,sum)))

user system elapsed

2.198 0.097 2.296

Using data.table

system.time(TEST[, list(act=weighted.mean(y,zw)), by=date])

user system elapsed

0.078 0.010 0.089

TEST had 1 million rows and 48 columns. This is about 25 times faster.34

tapply inconvenient for weighted averages, see error below

dat <- data.frame(x=rnorm(100),

w=rgamma(100,2),

fac=gl(3,1,len=100))

w1 <- with(dat,tapply(x,fac,weighted.mean,x,w))

## Error: ’x’ and ’w’ must have the same length

w1 <- with(dat,tapply(x*w,fac,sum)/tapply(w,fac,sum))

w1

1 2 3

0.2199 -0.1541 0.1462

35

data.table allows more natural syntax

require(data.table)

dat <- as.data.table(dat)

w2 <- dat[, list(avg=weighted.mean(x,w)), by=fac]

w2

fac avg

1: 1 0.2199

2: 2 -0.1541

3: 3 0.1462

36

Fast grouping

more generally . . .

dat <- as.data.table(dat)

## sort by V1,V2,...; no quotes

setkey(dat, V1, V2, ...)

## fast grouping

agg <- dat[, list(actual=weighted.mean(y,wt),

fit1=weighted.mean(fit1,wt),

...,

fitn=weighted.mean(fitn,wt)),

by=list(V1,...)]

More data.table details and examples available at

http://cran.r-project.org/web/packages/data.table/vignettes/

datatable-faq.pdf

37

http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf

http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf

A Code

Sandwich covariance matrix estimator

sandwich <- function(x, ...) UseMethod("sandwich")

cheese <- function(x, ...) UseMethod("cheese")

sandwich.gnlm <- function (x, bread. = bread, cheese. = cheese, ...) {

bread. <- bread.(x)

cheese. <- cheese.(x, ...)

n <- length(x$fitted)

1/n * (bread. %*% cheese. %*% bread.)

}

bread.gnlm <- function(x) {

n <- length(x$fitted)

x$cov * n

}

cheese.gnlm <- function(x, ...) {

psi <- estfun(x,...)

n <- NROW(psi)

crossprod(as.matrix(psi))/n

}

38

## sandwich workhorse: estimating functions

estfun.gnlm <- function(x,...) {

wt <- x$prior.weights

p <- coef(x)

y <- eval(parse(text=x$respname))

distribution <- x$distribution

if (distribution == "binomial" || distribution == "double binomial" ||

distribution == "beta binomial" || distribution == "mult binomial") {

y <- cbind(y, 1 - y)

nn <- y[, 1] + y[, 2]

}

if (distribution == "double Poisson" || distribution == "mult Poisson")

my <- 3 * max(y)

n <- length(fitted(x))

mu1 <- function(p) x$mu(p)

if(!is.null(x$shape)) {

sh1 <- if(x$nps ==1)

function(p) p[(x$npl+1)]*rep(1,n)

else

function(p) x$shape(p[(x$npl+1):(x$npl+x$nps)])

}

## get the log-likelihood function used at estimation

loglik <- eval(parse(text=sub("sum\\(", "(",deparse(x$likefn))))

psi <- simpleJacobian(loglik,p,...)

psi

}

39

Simple Jacobian

## first derivatives using simple epsilon difference

simpleJacobian <- function(func,beta,eps=0.0001){

## score 'func' on 'dat' using coeffs p

f <- func(beta)

n <- length(beta)

h <- matrix(NA, length(f), n)

for(i in 1:n) {

p <- beta

p[i] <- p[i] + eps

h[,i] <- (func(p) - f)/eps

}

return(h)

}

40

Predict.gnlm and delta method for se.fit

predict.gnlm <- function(object,newdata=NULL,.name=NULL,se.fit=FALSE,empirical.cov=NULL,...){

require(gnlm)

if(missing(newdata)) return(object$fitted)

.ndata <- if(!is.null(.name)) .name else deparse(substitute(newdata))

newdata <- as.data.frame(newdata)

modl <- attr(object$mu, "model")

if(!is.expression(modl))

stop("attr(object$mu, 'model') must be an expression")

mu <- function(p) {}; body(mu) <- modl

mu <- fnenvir(mu, newdata, .name=.ndata)

pred <- mu(coef(object))

if(se.fit | !is.null(empirical.cov)) {

Sigma <- if(!is.null(empirical.cov)) empirical.cov else object$cov

Delta <- simpleJacobian(mu,coef(object),...)

## se <- sqrt(diag(Delta %*% Sigma %*% t(Delta)))

## to avoid huge NxN matrices, use this version instead

se <- drop((((Delta %*% Sigma) * Delta) %*% rep(1, ncol(Delta)))^0.5)

pred <- list(fit=pred,se.fit=se)

}

return(pred)

}

41

B References

G. E. P. Box and N. R. Draper. Empirical Model-Building and Response Surfaces.John Wiley, New York, 1987.

N. E. Breslow. Statistics in epidemiology: The case-control study. Journal of theAmerican Statistical Association, 91(433):14–28, 1996. URL http://www.jstor.

org/stable/2291379.

S. Greenland. Interval estimation by simulation as an alternative to and extensionof confidence intervals. International Journal of Epidemiology, 33(6):1389–1397,2004. URL http://ije.oxfordjournals.org/content/33/6/1389.full.

pdf+html.

G. King and L. Zeng. Logistic regression in rare events data. Political Analysis, 9(2):137–163, 2001. URL http://gking.harvard.edu/files/abs/0s-abs.

shtml.

J. K. Lindsey. Models for Repeated Measurements. Oxford University Press, 2 edition,1999.

42

http://www.jstor.org/stable/2291379

http://www.jstor.org/stable/2291379

http://ije.oxfordjournals.org/content/33/6/1389.full.pdf+html

http://ije.oxfordjournals.org/content/33/6/1389.full.pdf+html

http://gking.harvard.edu/files/abs/0s-abs.shtml

http://gking.harvard.edu/files/abs/0s-abs.shtml

anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/anatomy of a...

Documents