statistical analysis with big data dr. fred oswald rice university carma webcast november 6, 2015...

Post on 18-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Statistical Analysis with Big Data

Dr. Fred OswaldRice University

CARMA webcastNovember 6, 2015

University of South Florida - Tampa, FL

1

Why should I-O and business researchers be interested in big data?

Witness stories about “big data” in HR and personnel selection, with little to no mention of the (DECADES) of personnel assessment expertise in HR and I-O psychology…

2

Adding insult to injury…

3

Overview

• Run LASSO regression and random forests, two predictive models that are relatively novel to organizational research and useful for big data (and small data).

• Demonstrates a philosophy: Fit flexibly to improve prediction, but don’t overfit.

• Four ideas/implications related to predictive models are then discussed.

4

Example 1LASSO

Many R packages exist to perform 'big data' types of analyses, such as the LASSO (Least Absolute Shrinkage and Selection Operator).

LASSO not only estimates regression weights like 'normal' (OLS) regression; LASSO makes many weights shrink to zero.

Yeehaw!

5

LASSOCheck out the coefficients across the range of values of the “tuning parameter,” lambda.

You can see where LASSO regression consistently selects predictive variables and excludes others.

Yeehaw!

6

LASSOLet's see what variables tend to predict the job-search behaviors of employed managers (Boudreau, 2001).

12 predictors of job search[you would use actual data – but I simulated 1,000 cases here]

7

job satisfaction(less sat > search)

compensation(less $ > search)

gender (female > search)

Considering all other predictors at a givenvalue of lambda…

All LASSO solutions varying lambda…

Nothin’……………………………………OLS8

Least Angle Regression = LARS(graph = the trace of all LASSO soln's)

I think of LARS as Tiptoe RegressionTM,as it is more cautious than stepwise regression…

•Start with all coefficients = 0. •The predictor with the highest validity is the first one entered. •But now don't step, tiptoe…. Increase the regression weight from 0 toward its 'normal' regression value until one of the other predictors correlates with the residual (y-yhat) just as highly.•Enter that predictor next.•Now move the weights of those two predictors toward its'normal' regression solution, until a third predictor correlates equally with the residual. Enter that one.•Keep goin' until all predictors are entered.

This method efficiently provides all LASSO solutions! 9

LASSO is swell, but also check out…elastic net!When several predictors are correlated, LASSO will tend to select one of them rather than the group.

The elastic net will encourage selecting the group of predictors in this case:

Elastic net encourages parsimony (like LASSO) yet also tries to select those groups of related variables when they are predictive (like ridge regression).

OLS regression ridge regression LASSO

elastic net

10

In general: Cross-validation

As mentioned, other weights will work better across other samples (e.g., unit weights, but…something better?)

How to find out? o Train the model on a given set of data

(develop the weights)o Test the model on a fresh set of data

(apply the weights to new data; how good are predictions?

11

LASSO: k-fold Cross-validation

o Train the model on a given set of data(develop the weights)

o Test the model on a fresh set of data(apply the weights to new data; how good are predictions?

10-fold cross-validation:o Divide the data into 10 equal parts or “folds”

(randomly, or stratified random)o Develop the model on 1 fold; test the model on the resto Do this for each fold, so that each fold participates in

generating model predictions.o Average the 9 predicted results across each case.o [demonstrate w/ LASSO regression]

12

LASSO k-fold Cross-validation

optimal

simpler

tuning parameter

# of predictors

13

Example 2Random Forests

First look at trees, used to classify data:

Example

= high (> X)

= low (≤ X)

Predicted Task Performance Scores

Cognitive Ability

Conscientiousness

Biodata

3.6 4.1 4.3 5.6

Teamwork Openness

3.83.2

14

Example 2Random Forests• Draw a large number of boostrapped samples

(e.g., k = 500 samples, each based on 2/3 of the data, w/ replacement).• For each sample, build a tree similar to the one just illustrated…

but with a catch: At each node, only consider a random subset of the predictors that can split the node (square root of # all predictors is a default).

• This yields diverse trees: different variables at each node, different cutpoints at each variable.

15

Example 2 (cont’)Random Forests

• For each tree, look at the “out of bag” (oob) data – data that did NOT participate in building the tree. Take these data and run them down the tree to get their predicted scores.

• For each case, average the predicted scores across trees.• Average them.

caseŷTree 1

ŷTree 2 …

ŷTree k

Average Predictedŷ

1 (in tree) 3.4 … 2.1 3.554

2 4.5 (in tree) … (in tree) 4.312

… … … … … …

N 2.6 (in tree) … 2.4 2.561

16

Some thoughts while under the hood…

1.

Thinking about the increasing amount of data companies have on hand:•Some of those data are directly relevant to selection (lots of online applications, screening tests).•Other data might be relatively indirect, but an argument can be made for selection (resume text mining).•Still other data are indirect but difficult to justify even if predictive (e.g., time to complete an application online). …What do we do in this situation?

If Big Data only captures the 3Vs on an ever-expanding hard drive, it is useless.

Taylor 2013, HBR Blog: “We can amass all the data in the world, but if it doesn’t help to save a life, allocate resources better, fund the organization, or avoid a crisis, what good is it?”

17

Some thoughts while under the hood…

2.Useful ‘signals’ in data discovered through predictive modeling could be amplified by developing measures that collect more data(given enough development time, testing time, $...).

(Fayyad et al., 1996)(knowledge new measures new knowledge)

18

Some thoughts while under the hood…

3.

So if predictions become more accurate and robust than ever…will we understand (theorize about, legally defend) them any better?

(related idea: why do we have reliability, why not just validity?)

Big Data analytics is a form of engineering.

Generally, our substantive research focuses on correlations, mean differences, etc. at the factor or scale level, not at the single-item level as big data might.

History tells us that item-level analyses can be hard to interpret (e.g., DIF). Interpretable surprises are hard to find.

19

Some thoughts while under the hood…

4. Big Data analytics provides reasons/opportunities to collaborate – if there is a culture for that.

HR Assessment + Analytics + IT + Management + Teams/Employees +….

Ties to other org functions (perhaps served by Big Data)

20

Thank you!

Fred Oswald

foswald@rice.edu

21

top related