statistical analysis with big data dr. fred oswald rice university carma webcast november 6, 2015...

21
Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Upload: preston-russell

Post on 18-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Statistical Analysis with Big Data

Dr. Fred OswaldRice University

CARMA webcastNovember 6, 2015

University of South Florida - Tampa, FL

1

Page 2: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Why should I-O and business researchers be interested in big data?

Witness stories about “big data” in HR and personnel selection, with little to no mention of the (DECADES) of personnel assessment expertise in HR and I-O psychology…

2

Page 3: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Adding insult to injury…

3

Page 4: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Overview

• Run LASSO regression and random forests, two predictive models that are relatively novel to organizational research and useful for big data (and small data).

• Demonstrates a philosophy: Fit flexibly to improve prediction, but don’t overfit.

• Four ideas/implications related to predictive models are then discussed.

4

Page 5: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Example 1LASSO

Many R packages exist to perform 'big data' types of analyses, such as the LASSO (Least Absolute Shrinkage and Selection Operator).

LASSO not only estimates regression weights like 'normal' (OLS) regression; LASSO makes many weights shrink to zero.

Yeehaw!

5

Page 6: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

LASSOCheck out the coefficients across the range of values of the “tuning parameter,” lambda.

You can see where LASSO regression consistently selects predictive variables and excludes others.

Yeehaw!

6

Page 7: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

LASSOLet's see what variables tend to predict the job-search behaviors of employed managers (Boudreau, 2001).

12 predictors of job search[you would use actual data – but I simulated 1,000 cases here]

7

Page 8: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

job satisfaction(less sat > search)

compensation(less $ > search)

gender (female > search)

Considering all other predictors at a givenvalue of lambda…

All LASSO solutions varying lambda…

Nothin’……………………………………OLS8

Page 9: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Least Angle Regression = LARS(graph = the trace of all LASSO soln's)

I think of LARS as Tiptoe RegressionTM,as it is more cautious than stepwise regression…

•Start with all coefficients = 0. •The predictor with the highest validity is the first one entered. •But now don't step, tiptoe…. Increase the regression weight from 0 toward its 'normal' regression value until one of the other predictors correlates with the residual (y-yhat) just as highly.•Enter that predictor next.•Now move the weights of those two predictors toward its'normal' regression solution, until a third predictor correlates equally with the residual. Enter that one.•Keep goin' until all predictors are entered.

This method efficiently provides all LASSO solutions! 9

Page 10: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

LASSO is swell, but also check out…elastic net!When several predictors are correlated, LASSO will tend to select one of them rather than the group.

The elastic net will encourage selecting the group of predictors in this case:

Elastic net encourages parsimony (like LASSO) yet also tries to select those groups of related variables when they are predictive (like ridge regression).

OLS regression ridge regression LASSO

elastic net

10

Page 11: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

In general: Cross-validation

As mentioned, other weights will work better across other samples (e.g., unit weights, but…something better?)

How to find out? o Train the model on a given set of data

(develop the weights)o Test the model on a fresh set of data

(apply the weights to new data; how good are predictions?

11

Page 12: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

LASSO: k-fold Cross-validation

o Train the model on a given set of data(develop the weights)

o Test the model on a fresh set of data(apply the weights to new data; how good are predictions?

10-fold cross-validation:o Divide the data into 10 equal parts or “folds”

(randomly, or stratified random)o Develop the model on 1 fold; test the model on the resto Do this for each fold, so that each fold participates in

generating model predictions.o Average the 9 predicted results across each case.o [demonstrate w/ LASSO regression]

12

Page 13: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

LASSO k-fold Cross-validation

optimal

simpler

tuning parameter

# of predictors

13

Page 14: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Example 2Random Forests

First look at trees, used to classify data:

Example

= high (> X)

= low (≤ X)

Predicted Task Performance Scores

Cognitive Ability

Conscientiousness

Biodata

3.6 4.1 4.3 5.6

Teamwork Openness

3.83.2

14

Page 15: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Example 2Random Forests• Draw a large number of boostrapped samples

(e.g., k = 500 samples, each based on 2/3 of the data, w/ replacement).• For each sample, build a tree similar to the one just illustrated…

but with a catch: At each node, only consider a random subset of the predictors that can split the node (square root of # all predictors is a default).

• This yields diverse trees: different variables at each node, different cutpoints at each variable.

15

Page 16: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Example 2 (cont’)Random Forests

• For each tree, look at the “out of bag” (oob) data – data that did NOT participate in building the tree. Take these data and run them down the tree to get their predicted scores.

• For each case, average the predicted scores across trees.• Average them.

caseŷTree 1

ŷTree 2 …

ŷTree k

Average Predictedŷ

1 (in tree) 3.4 … 2.1 3.554

2 4.5 (in tree) … (in tree) 4.312

… … … … … …

N 2.6 (in tree) … 2.4 2.561

16

Page 17: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Some thoughts while under the hood…

1.

Thinking about the increasing amount of data companies have on hand:•Some of those data are directly relevant to selection (lots of online applications, screening tests).•Other data might be relatively indirect, but an argument can be made for selection (resume text mining).•Still other data are indirect but difficult to justify even if predictive (e.g., time to complete an application online). …What do we do in this situation?

If Big Data only captures the 3Vs on an ever-expanding hard drive, it is useless.

Taylor 2013, HBR Blog: “We can amass all the data in the world, but if it doesn’t help to save a life, allocate resources better, fund the organization, or avoid a crisis, what good is it?”

17

Page 18: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Some thoughts while under the hood…

2.Useful ‘signals’ in data discovered through predictive modeling could be amplified by developing measures that collect more data(given enough development time, testing time, $...).

(Fayyad et al., 1996)(knowledge new measures new knowledge)

18

Page 19: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Some thoughts while under the hood…

3.

So if predictions become more accurate and robust than ever…will we understand (theorize about, legally defend) them any better?

(related idea: why do we have reliability, why not just validity?)

Big Data analytics is a form of engineering.

Generally, our substantive research focuses on correlations, mean differences, etc. at the factor or scale level, not at the single-item level as big data might.

History tells us that item-level analyses can be hard to interpret (e.g., DIF). Interpretable surprises are hard to find.

19

Page 20: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Some thoughts while under the hood…

4. Big Data analytics provides reasons/opportunities to collaborate – if there is a culture for that.

HR Assessment + Analytics + IT + Management + Teams/Employees +….

Ties to other org functions (perhaps served by Big Data)

20

Page 21: Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

Thank you!

Fred Oswald

[email protected]

21