anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/anatomy of a...

42
Anatomy of a rare event model: sampling, estimation, and inference Jagat Sheth St. Louis RUG November, 2012

Upload: others

Post on 05-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Anatomy of a rare event model:sampling, estimation, and

inference

Jagat Sheth

St. Louis RUGNovember, 2012

Page 2: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Abstract

This presentation will discuss typical stages in the workflow of modeling rare events.It will include a little theory along with R code for sampling, estimation, prediction,and fast grouping for ad hoc diagnostics.

Page 3: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Contents

Contents 3

1 Introduction 5

2 Workflow 7

3 R DBI 93.1 RODBC 103.2 Sampling queries 11

4 Design Weights 134.1 R code 14

5 Modeling 165.1 Specification 185.2 Estimation 235.3 Prediction etc. 26

3

Page 4: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

6 Fast grouping 34

A Code 38

B References 42

4

Page 5: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

1 Introduction

• Rare events: Rare event data is characterized by an abundance of noneventsthat far exceeds the number of events, e.g., 10, 000 events and 1, 990, 000nonevents yielding a minikin 0.5% event rate

• Issues: Processing and analyzing all nonevents is costly in terms of computertime and resources, and the efficiency of the resultant rare event model

• Three-step solution: Use a designed sample

1. sample all events

2. sample down the nonevents to a multiple of the number of events

3. weight the reduced sample to reflect the population event rate

5

Page 6: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

The three-step approach is easy to understand; however, this approach can only beuseful with appropriate statistical corrections.

Here we’ll discuss a little theory along with simple R code for

• rare event sampling via RODBC + sample design weighting

• estimation and inference for modeling and prediction

– non-linear models via nlm and Jim Lindsey’s gnlm

– robust standard errors via sandwich

– frequentist vs. Bayesian

• fast grouping for ad hoc diagnostics R package data.table

6

Page 7: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

2 Data + modeling workflow

Database

R DBI e.g. RODBC��

Stratifiedrandomsample

small subset

{{

big subset

##

TRAIN

R’s nlm ��

TEST

updatepp

��

R’s data.tablefast grouping

""

Model

predict00

X1 Xn

7

Page 8: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

E.g. Company XYZ currently uses an IBM DB2 V9.2 database server

Table Total Events rateBig data 848 Mil 13.6 Mil 1.6%

Table has 200+ columns holding monthly event history data since Jan 2000.

On a typical day, approx 1 hr to pull all events, and 2 hr to randomly sample twicethe number of non-events.

8

Page 9: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

3 R Database Interface (DBI)

• Several options exist for communication between R and relational databasemanagement systems

– RODBC, RMySQL, ROracle, RPostgreSQL, RSQLite, RJDBC, RMonetDB,etc.

• Here, we used RODBC to talk to an IBM DB2 database server

Note: Installation of RODBC requires an ODBC Driver Manager. Windows normallycomes with one. Mac OS X since 10.2 has shipped with iODBC. But for other systemsthe driver manager of choice is unixODBC with sources downloadable from http:

//www.unixodbc.org.

9

Page 10: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

3.1 RODBC

## Open connection to your ODBC database

require(RODBC)

con <- odbcConnect(dsn, uid = "xxx", pwd = "xxx", ...)

## Submit an SQL query

pop <- sqlQuery(con, getTotals)

cases <- sqlQuery(con, getStrata(1))

controls <- sqlQuery(con, getStrata(0,frac=0.03))

res <- rbind(cases, controls)

close(con)

Build up SQL query using paste in R

10

Page 11: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

3.2 Sampling queries

getTotals <- paste("SELECT COUNT(*) as total,",

"SUM(event) as cases",

"FROM ... WHERE ...")

getStrata <- function(val=1, frac=NULL){

myQuery <- paste("SELECT", ...,

"FROM", ...,

"WHERE", "event=", val, "AND ...")

if(!is.null(frac))

myQuery <- paste(myQuery, "AND RAND() <=", frac)

myQuery

}

RAND() can be replaced with something more efficient if available on your DB.

11

Page 12: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

R function to help

require(sqlsurvey)

sqlsubst("SELECT %%vars%% FROM %%table%% GROUP BY %%strat%%",

list(vars=varnames, table=tablequery, strat=strata))

from package sqlsurvey (experimental on R-forge)

12

Page 13: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

4 Design Weights

The design weights represent the inverse probability of unit selection into the samplederived from stratified simple random sampling. If a population of size N is stratifiedby a binary variable Y with K ones, and sample size is n with k ones, then

s1 = P (unit in case-control | Y = 1) = 1

s0 = P (unit in case-control | Y = 0) =n− kN −K

and

w1 = 1/s1 = 1 (1)

w0 = 1/s0 =N −Kn− k

(2)

The design weights can be generated by a single coding line

wi = w1Yi + w0(1− Yi) (3)

13

Page 14: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

4.1 Simple R code

## Design weights for case-control

N <- pop$TOTAL

K <- pop$CASES

n <- nrow(res)

k <- sum(res$event)

res$wt <- with(res,1*event+(N-K)/(n-k)*(1-event))

14

Page 15: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Alternatively, can use normalized weights (which sum to n)

w1 =τ

yand w0 =

(1− τ )

(1− y)

where τ = K/N, y = k/n are the fraction of events in the population and sample,respectively.

tau <- K/N; ybar <- k/n

See (King and Zeng, 2001) or (Breslow, 1996) for further details.

15

Page 16: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

5 Modeling Workflow

• Recall we are using the stratified case-control sample in conjunction with itsassociated design weights as an efficient representation of the (much larger)population

• Regressions, aggregations, and summary statistics are now WEIGHTED by thedesign weight

• Iterative process of model development

– an initial model specification is estimated or updated on a (relatively)small TRAINING set

– score models on a larger (out-of-sample and possibly out-of-time) TESTset

– group (aggregate) scores and actuals to assess performance by several keydriver dimensions X1, . . . , Xn

– iterate these steps until satisfied, making sure model strengths and weak-nesses are well understood

16

Page 17: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

TRAIN

R’s nlm ��

TEST

update

tt

��

R’s data.tablefast grouping

""

Model

predict

33

X1 Xn

ALL MODELS ARE WRONG BUT SOME MAY BE USEFUL

George E.P. Box

17

Page 18: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

5.1 Model specification

Weighted Log-likelihood

Let f (y;µ, θ) = probability density for Y with location parameter µ. Binomial orPoisson are typical for rare events. Model the location parameter as

µi = g(β,xi) (4)

for some (parametric) function g, drivers xi, and coefficients β. Then use maximumlikelihood to estimate β (and θ if f needs it; not required for binom or poiss)

Maximize the weighted log-likelihood

logLw(β, θ|y) =∑i

(wi log f (yi;µi, θ))

where the weights wi are given in Equation 3 and log is natural log.

18

Page 19: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Advantages

• All that needs to be done is calculate wi in Equation 3, choose it as the weightin your computer program, and then run a model, e.g. logistic regression.

• Weighting can outperform alternative procedures such as prior correction whenboth a large sample is available and the functional form is misspecified.

Disadvantages

• The usual method of computing standard errors is severely biased.

One way to address this disadvantage is by using empirical design-based stan-dard errors e.g. Huber-White sandwich covariance estimator or a complexsurvey design package.

19

Page 20: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Example spec of a mortgage prepayment model

Two component model for an FHA portfolio with each component modeled as amultiplicative hazard assuming a Poisson distribution.

spec <- function(p){

HT <- p[1] + ll.haz(age, p[2]) + log(seasfix30) +

mono.haz(chgequity,p[7],p[8]) + p[15]*lockin

RF <- p[3] + log(expit(eit, p[4],p[5])) +

ll.haz(age,p[6]) + p[9]*(vint==1) + p[10]*(vint==2) +

p[11]*dr12 + p[12]*log(zato) +

log(expit(fico,p[13],p[14])) +

p[16]*burnout + p[17]*back

exp(HT) + exp(RF)

}

HT = ‘housing turnover’, RF = ‘refinance’

20

Page 21: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

−2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

RF: S−curve

economic incentive at time tEIT = NOTE − MKT

prep

ay p

roba

bilit

y (S

MM

)

0 50 100 200 300

0.01

0.03

0.05

HT: Aging curve

Loan term of 360 monthsLoan age

prep

ay p

roba

bilit

y (S

MM

)

0 50 100 200 300

0.2

0.4

0.6

0.8

1.0

1.2

RF: Aging multiplier

multipliers normalized at age=30Loan age

mul

tiplie

r

0.5 1.0 1.5 2.0 2.5 3.0

01

23

4

HT: Chgequity multiplier

multipliers normalized at chgequity=1.25chgequity=oltv/cltv

mul

tiplie

r

21

Page 22: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

## log hazard for log-logistic distribution

ll.haz <- function(t,m,s){

-(log(t) + log(1 + exp(m/s)*t^(-1/s)))

}

## monotonic hazard

mono.haz <- function(t,a,b){

- exp(a)*exp(-exp(b)*t)

}

## inverse-logit (S-curve)

expit <- function(x,a,b) {

1/(1 + exp(-(a + b*x)))

}

22

Page 23: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

5.2 Estimation via nlm

• The model spec is highly nonlinear, therefore neither linear nor generalizedlinear model methods of estimation will apply.

• Here, spec is parametric and relatively parsimonious, so maximum likelihoodestimation was feasible.

• The basic procedure for obtaining maximum likelihood estimates of the unknownparameters in the likelihood function is to solve the score equations obtainedby setting the first derivative of the (weighted) log likelihood function, withrespect to these parameters, equal to zero.

• The most common iterative procedure to solve them is some variation ofNewton-Rhapson which requires the second derivatives of the log likelihoodfunction.

• R’s nlm function carries out a minimization using a Newton-type algorithm andcalculates both sets of derivatives numerically so that only the (negative) loglikelihood need be supplied.

23

Page 24: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Used Jim Lindsey’s gnlm package which provides a variety of functions to fit linearand nonlinear regression with a large selection of distributions. Also see (Lindsey,1999).

attach(train)

zp <- gnlr(y,"Poisson",mu=spec,pmu=...,wt=...)

cov.empirical <- sandwich(zp)

se.empirical <- sqrt(diag(cov.empirical))

detach(train)

24

Page 25: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

cbind(estimate=zp$coef, se.model=zp$se, se.empirical)

estimate se.model se.empirical

p[1] 0.33801 0.58616 0.168391

p[2] 2.39070 0.71086 0.172019

p[3] -0.64416 0.80647 0.282599

p[4] -2.42126 0.76439 0.274128

p[5] 2.32347 0.95267 0.309260

p[6] 2.81824 0.29312 0.088307

p[7] 5.55685 2.18565 0.618194

p[8] 1.43363 0.49618 0.137823

p[9] 1.96005 0.41008 0.153742

p[10] 1.09278 0.41019 0.154422

p[11] -2.43484 2.12904 0.757065

p[12] 1.01519 0.26680 0.108856

p[13] -8.41495 6.55427 2.239078

p[14] 9.88181 8.23264 2.835693

p[15] 0.27840 0.67309 0.162903

p[16] -0.01251 0.01226 0.004041

p[17] -0.15452 0.37606 0.128069

25

Page 26: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

5.3 Prediction, Validation, and Inference

Prediction

• Generate model’s predictions on newdata, e.g. a TEST set

• Unlike objects of class ‘lm’, ‘glm’ there is no predict.gnlm

• Do this from scratch by evaluating an R call or expression of this model spec,in the environment newdata (a data frame), along with an optional enclosure

encl <- list2env(list(p=coef(zp)))

pred <- eval(body(spec), envir=newdata, enclos=encl)

Note: eval copies newdata into a temporary environment (with enclosure‘enclos’), and the temporary environment is used for evaluation.

26

Page 27: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

• Alternatively, try fnenvir from package rmutil

func <- fnenvir(spec,.envir=newdata)

pred <- func(p=coef(zp))

fnenvir finds the covariates and parameters in a function and can modify it so thatthe covariates used in it are found in the data object specified by .envir.

This avoids copying newdata into a temporary environment for evaluation.

27

Page 28: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

fnenvir(spec, .envir=newdata)

model function:

structure({

HT <- p[1] + ll.haz(newdata$age, p[2]) + log(newdata$seasfix30) +

mono.haz(newdata$chgequity, p[7], p[8]) + p[15] * lockin

RF <- p[3] + log(expit(newdata$eit, p[4], p[5])) + ll.haz(newdata$age,

p[6]) + p[9] * (newdata$vint == 1) + p[10] * (newdata$vint ==

2) + p[11] * newdata$dr12 + p[12] * log(newdata$zato) +

log(expit(newdata$fico, p[13], p[14])) + p[16] * newdata$burnout +

p[17] * newdata$back

exp(HT) + exp(RF)

}), srcfile = <environment>, wholeSrcref = structure(c(1L, 0L,

6L, 0L, 0L, 0L, 1L, 6L), srcfile = <environment>, class = "srcref")

covariates:

age seasfix30 chgequity eit vint dr12 zato fico burnout back

parameters:

p[1] p[2] p[3] p[4] p[5] p[6] p[7] p[8] p[9] p[10] p[11] p[12] p[13] p[14] p[15] p[16] p[17]

28

Page 29: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Validation

This includes

1. predicting on newdata (a TEST set)

2. grouping predictions and actuals by key driver dimensions to assess perfor-mance

Here, newdata has 1 million rows including both out-of-sample and out-of-time ob-servations.

TRAIN set used only 10K observations.

29

Page 30: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

2000 2002 2004 2006 2008 2010 2012

0.01

0.03

0.05

0.07

Asset Date

Time

SM

M

actualvintclpm

0 100 200 300

0.00

0.02

Age

Loan Age

SM

M

actualvintclpm

50 100 150

0.01

00.

020

LTV

Current LTV

SM

M

actualvintclpm

550 600 650 700 750 800

0.00

40.

010

FICO

Current FICOS

MM

actualvintclpm

−3 −2 −1 0 1 2 3

0.00

00.

015

0.03

0

EIT

EIT=note − market

SM

M

actualvintclpm

30

Page 31: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Inference

• In addition to model selection criteria like AIC, BIC, and all that, its useful tocompute standard errors for model predictions.

• Standard errors are useful if TRAIN was not“too large”(bias vs. variance trade-off; “too large” sample => variance almost 0, leaving only bias)

• How to compute for a nonlinear function f (β)? Delta method is a commonapproach which gives a first-order approximate standard error for a nonlinearfunction of estimated coefficient vector β as follows

hChT

where h = Jf(β) the Jacobian of f and C = Cov(β) is estimated covariancematrix (possibly robust)

• Here, we used numerical derivatives to compute Jacobian although analytic ispossible albeit messy.

31

Page 32: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

2000 2002 2004 2006 2008 2010 2012

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Actuals with prediction intervals

Delta−method implementation for prediction uncertainty.Time

SM

Mpred ± 2se actual

pred

32

Page 33: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Are conventional statistics anything other than misleading?

. . . conventional [frequentist] confidence limits correspond to posterior(Bayesian) limits under the absurd fantasy that the data are from a per-fect randomized trial conducted on a random sample of the target population,and that any deviations from this ideal are inconsequential . . .

. . . These problems can be addressed by examining the entire distributionproduced by a simulation, not just a pair of limits (since the choice of 95% lim-its is nothing more than a social convention), and by asking what other sourcesof uncertainty and information ought to be incorporated into the distributionbefore taking it seriously. (Greenland, 2004)

Remember that all models are wrong; the practical question is how wrong do

they have to be to not be useful. (Box and Draper, 1987)

33

Page 34: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

6 Fast Grouping

Try R package data.table for fast grouping and an alternative to tapply

Weighted averages exampleUsing tapply

system.time(with(TEST,

tapply(y*zw,date,sum)/tapply(zw,date,sum)))

user system elapsed

2.198 0.097 2.296

Using data.table

system.time(TEST[, list(act=weighted.mean(y,zw)), by=date])

user system elapsed

0.078 0.010 0.089

TEST had 1 million rows and 48 columns. This is about 25 times faster.34

Page 35: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

tapply inconvenient for weighted averages, see error below

dat <- data.frame(x=rnorm(100),

w=rgamma(100,2),

fac=gl(3,1,len=100))

w1 <- with(dat,tapply(x,fac,weighted.mean,x,w))

## Error: ’x’ and ’w’ must have the same length

w1 <- with(dat,tapply(x*w,fac,sum)/tapply(w,fac,sum))

w1

1 2 3

0.2199 -0.1541 0.1462

35

Page 36: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

data.table allows more natural syntax

require(data.table)

dat <- as.data.table(dat)

w2 <- dat[, list(avg=weighted.mean(x,w)), by=fac]

w2

fac avg

1: 1 0.2199

2: 2 -0.1541

3: 3 0.1462

36

Page 37: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Fast grouping

more generally . . .

dat <- as.data.table(dat)

## sort by V1,V2,...; no quotes

setkey(dat, V1, V2, ...)

## fast grouping

agg <- dat[, list(actual=weighted.mean(y,wt),

fit1=weighted.mean(fit1,wt),

...,

fitn=weighted.mean(fitn,wt)),

by=list(V1,...)]

More data.table details and examples available at

http://cran.r-project.org/web/packages/data.table/vignettes/

datatable-faq.pdf

37

Page 38: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

A Code

Sandwich covariance matrix estimator

sandwich <- function(x, ...) UseMethod("sandwich")

cheese <- function(x, ...) UseMethod("cheese")

sandwich.gnlm <- function (x, bread. = bread, cheese. = cheese, ...) {

bread. <- bread.(x)

cheese. <- cheese.(x, ...)

n <- length(x$fitted)

1/n * (bread. %*% cheese. %*% bread.)

}

bread.gnlm <- function(x) {

n <- length(x$fitted)

x$cov * n

}

cheese.gnlm <- function(x, ...) {

psi <- estfun(x,...)

n <- NROW(psi)

crossprod(as.matrix(psi))/n

}

38

Page 39: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

## sandwich workhorse: estimating functions

estfun.gnlm <- function(x,...) {

wt <- x$prior.weights

p <- coef(x)

y <- eval(parse(text=x$respname))

distribution <- x$distribution

if (distribution == "binomial" || distribution == "double binomial" ||

distribution == "beta binomial" || distribution == "mult binomial") {

y <- cbind(y, 1 - y)

nn <- y[, 1] + y[, 2]

}

if (distribution == "double Poisson" || distribution == "mult Poisson")

my <- 3 * max(y)

n <- length(fitted(x))

mu1 <- function(p) x$mu(p)

if(!is.null(x$shape)) {

sh1 <- if(x$nps ==1)

function(p) p[(x$npl+1)]*rep(1,n)

else

function(p) x$shape(p[(x$npl+1):(x$npl+x$nps)])

}

## get the log-likelihood function used at estimation

loglik <- eval(parse(text=sub("sum\\(", "(",deparse(x$likefn))))

psi <- simpleJacobian(loglik,p,...)

psi

}

39

Page 40: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Simple Jacobian

## first derivatives using simple epsilon difference

simpleJacobian <- function(func,beta,eps=0.0001){

## score 'func' on 'dat' using coeffs p

f <- func(beta)

n <- length(beta)

h <- matrix(NA, length(f), n)

for(i in 1:n) {

p <- beta

p[i] <- p[i] + eps

h[,i] <- (func(p) - f)/eps

}

return(h)

}

40

Page 41: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

Predict.gnlm and delta method for se.fit

predict.gnlm <- function(object,newdata=NULL,.name=NULL,se.fit=FALSE,empirical.cov=NULL,...){

require(gnlm)

if(missing(newdata)) return(object$fitted)

.ndata <- if(!is.null(.name)) .name else deparse(substitute(newdata))

newdata <- as.data.frame(newdata)

modl <- attr(object$mu, "model")

if(!is.expression(modl))

stop("attr(object$mu, 'model') must be an expression")

mu <- function(p) {}; body(mu) <- modl

mu <- fnenvir(mu, newdata, .name=.ndata)

pred <- mu(coef(object))

if(se.fit | !is.null(empirical.cov)) {

Sigma <- if(!is.null(empirical.cov)) empirical.cov else object$cov

Delta <- simpleJacobian(mu,coef(object),...)

## se <- sqrt(diag(Delta %*% Sigma %*% t(Delta)))

## to avoid huge NxN matrices, use this version instead

se <- drop((((Delta %*% Sigma) * Delta) %*% rep(1, ncol(Delta)))^0.5)

pred <- list(fit=pred,se.fit=se)

}

return(pred)

}

41

Page 42: Anatomy of a rare event model: sampling, estimation, and ...files.meetup.com/1772780/Anatomy of a rare event model.pdf1Introduction • Rare events: Rare event data is characterized

B References

G. E. P. Box and N. R. Draper. Empirical Model-Building and Response Surfaces.John Wiley, New York, 1987.

N. E. Breslow. Statistics in epidemiology: The case-control study. Journal of theAmerican Statistical Association, 91(433):14–28, 1996. URL http://www.jstor.

org/stable/2291379.

S. Greenland. Interval estimation by simulation as an alternative to and extensionof confidence intervals. International Journal of Epidemiology, 33(6):1389–1397,2004. URL http://ije.oxfordjournals.org/content/33/6/1389.full.

pdf+html.

G. King and L. Zeng. Logistic regression in rare events data. Political Analysis, 9(2):137–163, 2001. URL http://gking.harvard.edu/files/abs/0s-abs.

shtml.

J. K. Lindsey. Models for Repeated Measurements. Oxford University Press, 2 edition,1999.

42