using data mining to model freshmen outcomes
TRANSCRIPT
Office of Institutional Research, Planning & Effectiveness
Using Data Mining to Model
Freshmen Outcomes
42nd NEAIR Annual Conference
Burlington, VT
Nora Galambos, PhD
Senior Data Scientist
Office of Institutional Research, Planning & Effectiveness
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1 2 3 4 5 6 7 8 9 10
Pe
rcen
t G
rad
uat
ed
First Semester GPA Deciles
Four and Six Year Graduation Rates
Graduated in 4 Years Graduated in 6 Years
Graduation Rates of First-time Full-time Freshmenby First Semester GPA Deciles*
*The fall freshmen cohorts of 2006 through 2008 were combined.
Office of Institutional Research, Planning & Effectiveness
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Returned Yr. 2 Returned Yr. 3 Grad. in 4 Yrs. Grad. In 6 Yrs.
GPA >=2.5 GPA<2.5
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Returned Yr. 2 Returned Yr. 3 Grad. in 4 Yrs. Grad. In 6 Yrs.
GPA >=3.0 GPA<3.0
Graduation and Retention of First-time Full-time Freshmenby First Semester GPA*
Approximately 30% have a GPA below 2.5. 3.0 is approximately the median GPA.
*The fall freshmen cohorts of 2006 through 2008 were combined.
Office of Institutional Research, Planning & Effectiveness
Using Data Mining for the Early Predictionof Freshmen Outcomes
• Enables the extraction of information from large amounts of data.
• Incorporates analytic tools for data-driven decision making.
• Uses modeling techniques to apply results to future data.– The goal is to develop a model rather than finding factors
significantly associated with the outcomes.
• Incorporates statistics, pattern recognition, and mathematics.
• Few assumptions to satisfy relative to traditional hypothesis driven methods.
• A variety of different methods for different types of data and predictive needs.
• Able to handle a great volume of data with hundreds of predictors.
Office of Institutional Research, Planning & Effectiveness
Data Mining: Training and Validationto Develop an Accurate Model
• The purpose of the analysis is to predict the GPA of first-time
full-time fall 2014 freshmen using high school characteristics,
demographics, and academic information from the first half of
the first term—early enough to assign students to interventions.
• Need to find the correct level of model complexity.
• A model that is not complex enough may lack the flexibility
to represent the data, under-fitting.
• When the model is too complex it can be influenced by
random noise, over-fitting.
• For example, if there are outliers, an overly complex model
will be fit to them. Then when the model is run on new
data, it may be a poor fit. A poor fitting model will not do a
good job in making predictions using new data.
Data Partitioning • Partitioning is used to avoid over- or under-fitting. Divide the data
into three parts: training, validation, and testing.
• The training partition is used to build the model.
• The validation partition is set aside and is used to test the
accuracy and fine tune the model.
• The prediction error is calculated using the validation data.
• An increase in the error in the validation set may be caused by
over-fitting. The model may need modification.
• Often 60% is used for training the model and 40% is used for
validation--or 40% for training, 30% for validation, and 30% for
testing
• Problem: With only 2,852 fall freshmen and over 50 variables,
there is not enough data available to develop a model to
predict the 300 to 500 low GPA students using traditional
partitioning.
Office of Institutional Research, Planning & Effectiveness
Office of Institutional Research, Planning & Effectiveness
• Why use k-fold cross-validation?
• Works with limited data.
• The initial steps are the similar to traditional data analysis.
• The entire dataset is used to choose the predictors.
• Cross-validation is used to evaluate the model, not to develop the model.
• The error is estimated by averaging the error of the K test samples.
• In subsequent years, once more data has been collected, the easier to implement training-validation partitioning method can be used.
K-fold Cross-validationfor Evaluating Model Performance
Office of Institutional Research, Planning & Effectiveness
• The sample is divided into K equal groups, or folds.
• For this data with the focus on finding the lowest GPA students, only 5 folds were used.
• Next the model is run K times, however each time, one fold is left out.
• For five folds, four are for training and one is for validation
• The procedure is performed K times (in this case five times), each time leaving out a different validation sample
K-fold Cross-validationfor Evaluating Model Performance
Office of Institutional Research, Planning & Effectiveness
Train
TrainValidate Train TrainTrain
TrainValidate TrainTrain Train
TrainValidateTrainTrain Train
TrainValidateTrainTrain Train
Train ValidateTrainTrain TrainK = 1
K = 2
K = 3
K = 4
K = 5
5-Fold Cross Validation PlanFor each Ki the entire dataset was divided into 5 equal parts
Office of Institutional Research, Planning & Effectiveness
Data Mining Methods Compared
• Five data mining methods were compared*
• CHAID: Chi-Square Automatic Interaction Detection
• BFOS – CART (Breiman, Friedman, Olshen, Stone
Classification and Regression Trees)
• Decision Tree Combined Methods
• Gradient Boosting
• Linear Regression
• The validation and training average squared errors (ASE)
were averaged over each of the five-fold runs to
determine the method that has, on average, the smallest
error and the smallest difference between the validation
and training errors.
* Software settings were used to approximate the different decision tree methods.
Office of Institutional Research, Planning & Effectiveness
BFOS CART MethodBreiman, Friedman, Olshen, StoneClassification and Regression Trees
• The method does an exhaustive search for the best binary split.
• It splits categorical predictors into a smaller number of groups or
finds the optimal split in numerical measures.• Each successive split is again split in two until no further splits are
possible.
• The result is a tree of maximum possible size, which is then pruned
back.
• For interval targets the variance is use to assess the splits; For
nominal targets the Gini impurity measure is used.
• Pruning starts with the split that has the smallest contribution to the
model
• The missing data is assigned to the largest node of a split
• Creates a set of nested binary decision rules to predict an outcome.
Office of Institutional Research, Planning & Effectiveness
• Unlike CART with binary splits evaluated by misclassification
measures, the CHAID algorithm uses the chi-square test (or
the F test for interval targets) to determine significant splits
and find independent variables with the strongest association
with the outcome. A Bonferroni correction to the p-value is
applied prior to the split.
• It may find multiple splits in continuous variables, and allows
splitting of categorical data into more than two categories.
• As with CART, CHAID allows different predictors for different
sides of a split.
• The CHAID algorithm will halt when statistically significant
splits are no longer found in the data.
CHAID: Chi-squared Automatic Interaction Detection
Office of Institutional Research, Planning & Effectiveness
• Up to four-way splits (4 branches) were allowed as opposed
to the CART binary split.
• The F test was used to evaluate the variance of the nodes.
• The depth of the overall tree was restricted to 6 levels.
• Missing values were assigned to produce an optimal split.
• The Average Squared Error (ASE) was used to evaluate the
subtrees.
• The cross validation option was selected in order to perform
the cross validation procedure for each subtree.
• That results in a sequence of estimates that were used
to select the subtree.
Decision Tree Combined Methods
Office of Institutional Research, Planning & Effectiveness
Gradient Boosting
• Uses a partitioning algorithm developed by Jerome Friedman.
• Resamples the data set a number of times without replacement.• A random sample is drawn at each iteration from the training data set and the
sample is used to update the model.
• The successive resampling results in a weighted average of the re-sampled
data.
• The weights assigned at each iteration improve the accuracy of the
predictions.
• The result is a series of decision trees, each one adjusted with new
weights based on the accuracy of the estimates or classifications of the
previous tree.
• Because the results at each stage are weighted and combined into a final
model, there is no resulting tree diagram.• The scoring code that is generated allows the model to be used to score new
data for predicting outcomes.
Office of Institutional Research, Planning & Effectiveness
Linear Regression in a Data Mining Environment
• Imputation was used to replace missing values.• Decision tree methods are able to handle missing values by combining
them with another category or placing them in a category with other values.
They can also be replaced by using surrogate rules.
• Linear regression, on the other hand, will listwise delete the missing
values.
• In student data, many predictors do not have complete data—e.g.,
financial aid measures (not all students apply for financial aid), SAT
scores (a certain small percentage of the entering freshmen do not
have scores), some students may not have courses utilizing the LMS.
• These measures result in an excessive amount of listwise deletion.
• The distribution method was used to replace the missing data
• Values are calculated from random percentiles of the distributions of
the predictors.
• If the linear regression appears promising, other imputation methods can
be explored and studied in greater detail.
Office of Institutional Research, Planning & Effectiveness
Linear Regression in a Data Mining Environment
• Clustering to reduce multicollinearity
• The other difficulty in using linear regression with so many
variables is multicollinearity. With a large volume of
predictors, it would be difficult and time consuming to
evaluate all of the potential multicollinearity issues.
• Clustering was used to group highly correlated variables.
• In each cluster, the variable with the highest correlation
coefficient was retained and entered into the modeling
process, and the others were eliminated.
Office of Institutional Research, Planning & Effectiveness
Predictive Measures
• Demographics
• Gender, ethnicity, geographic residence when admitted.
• Pre-college academic characteristics
• SAT scores, high school GPA, average SAT scores of the high
school (to control for high school GPA).
• College academic characteristics
• Credits accepted when admitted, AP credits, number of STEM and
non-STEM courses enrolled in, enrollment in high DFW courses,
area of major.
• Transactions, service utilization, activities.
• Learning management system (LMS) logins, advising visits, tutoring
center utilization, intramural and fitness class participation.
• Financial aid
• Expected family contribution, AGI, types and amounts of disbursed
aid, Pell, Tuition Assistance Program (TAP).
Office of Institutional Research, Planning & Effectiveness
About the Data
• Much of the data pertaining to interactions with student services and
learning management system logins has not been stored long term.
As a result, part of the data mining process includes collecting, saving,
and storing the data.
• Programs are being developed to automate the formatting and
aggregation of the transactional data so it can be merged with student
records and utilized in the data mining process.
• For the modeling use of the learning management system logins,
the data were summarized as follows:
• Only one login per course per hour was counted, so a course
can have up to 24 logins per day. This eliminated multiple
logins that occurred just few minutes apart.
Office of Institutional Research, Planning & Effectiveness
About the Data Continued
• Total logins were summarized as:
• Each student’s courses were categorized as STEM or non-STEM
• The STEM and non-STEM logins were totaled for week 1 and separately
for weeks 2 through 6.
• STEM and non-STEM logins were divided by the respective STEM and
non-STEM course totals to obtain per-course login rates (by STEM and
non-STEM).
• High DFW rate courses
• Courses with an enrollment of 70 or more having high percentages of
D’s, F’s, and W’s were identified and categorized as STEM and non-
STEM
• The number of high DFW courses, and the highest DFW rate for each
student was included in the model.
• Research and evaluation of the methods for summarizing and using the
data in the model is ongoing. Additional data sources will be added.
Office of Institutional Research, Planning & Effectiveness
Average Squared Error (ASE) Results for the Five Data Mining Methods
• ASE = (Sum of Squared Errors)/N
• Gradient boosting had the smallest average ASE followed by CART
• Gradient boosting and BFOS-CART, on average, had the smallest differences
between the validation and training errors
• The CART method was chosen for the modeling process.
• Relatively low ASE.
• Gradient boosting, without an actual tree diagram, would make the results more
difficult to explain to administrators.
K Folds
Validation
ASE
Training
ASE
Validation
ASE
Training
ASE
Validation
ASE
Training
ASE
Validation
ASE
Training
ASE
Validation
ASE
Training
ASE
1 0.333 0.363 0.394 0.427 0.444 0.355 0.421 0.335 0.374 0.396
2 0.353 0.358 0.425 0.423 0.479 0.325 0.432 0.330 0.477 0.388
3 0.377 0.351 0.429 0.432 0.508 0.312 0.472 0.325 0.515 0.363
4 0.391 0.351 0.436 0.433 0.510 0.304 0.495 0.304 0.522 0.376
5 0.422 0.343 0.525 0.393 0.511 0.345 0.515 0.312 0.561 0.371
Average
ASE 0.375 0.353 0.442 0.422 0.490 0.328 0.467 0.321 0.490 0.379
Gradient Boosting BFOS-CART CHAID Decision Tree Linear Regression
Office of Institutional Research, Planning & Effectiveness
Variable* Relative ImportanceHigh School GPA 1.0000Scholarship Aid 0.9643Total AP non-STEM course credits 0.8980Total AP STEM course credits 0.8729LMS logins per STEM courses weeks 2 -6 0.8619Total LMS STEM course logins, weeks 2 -6 0.8542LMS logins per non-STEM courses, weeks 2 -6 0.8214Area of residence when admitted 0.7921Total LMS non-STEM logins, weeks 2 – 6 0.7888Declared major 0.6902Total non-STEM units 0.6859Total LMS non-STEM course logins, week 1 0.6712Total STEM units 0.5789Avg. high school SAT Math, CR, writing score 0.5577Student SAT Math + CR 0.5540Avg. high school SAT CR score 0.5357Total LMS STEM course logins, week 1 0.5307Avg. high school SAT Math, CR total score 0.5176Total STEM courses 0.5119Avg. high school SAT Math score 0.5080Total non-STEM courses 0.4808Type of math course in term 1 0.4636
Variable Relative ImportanceTotal STEM courses using LMS 0.4258Advising visits, week 1 pertaining to registration 0.3826Ethnic group 0.3609Highest DFW rate in non-STEM course 0.3425Student SAT Math score 0.3197Total non-STEM courses using LMS 0.3115Total Athletics Aid 0.2736Total high DFW STEM units 0.2714Intramural sports participation 0.2548Tutoring Center visits for STEM courses, wks 1 – 6 0.2533Fitness Class attendance 0.2378Student SAT CR score 0.2146Highest DFW STEM rate 0.1868Honors College or Women in Science & Eng. 0.1827Total high DFW STEM courses 0.1624Stony Brook Math Placement Exam score 0.1500Student SAT Writing Score 0.1495Total grant aid 0.1436% of freshmen in highest DFW rate STEM course 0.1191Total loans distributed (per Fin. Aid Off. Records) 0.1155Advising visit during week 1, not registration-related 0.1149% of 1st-years in highest DFW rate non-STEM crs 0.0721
Relative Importance of Variables as Evaluated by CART
* Measures selected by CART for the model appear in red.
The “importance” evaluates a measure’s contribution to the model, utilizing the relative reduction
in the sum of squared errors when a node is split..
Office of Institutional Research, Planning & Effectiveness
Assessment Score Distribution
Range for Avg. F14 Number of Model
Predicted GPA Observations Score
3.64 - 3.76 3.76 37 3.70
3.51 - 3.64 3.60 459 3.57
3.39 - 3.51 3.46 257 3.45
3.27 - 3.39 3.35 78 3.33
3.14 - 3.27 3.23 344 3.21
3.02 - 3.14 3.08 665 3.08
2.90 - 3.02 2.93 478 2.96
2.65 - 2.78 2.74 89 2.71
2.53 - 2.65 2.61 362 2.59
2.41 - 2.53 2.52 16 2.47
2.04 - 2.16 2.12 18 2.10
1.92 - 2.04 1.94 25 1.98 1.55 - 1.67 1.59 13 1.61
1.30 - 1.43 1.30 11 1.36
Office of Institutional Research, Planning & Effectiveness
HS GPA<=92.0LMS logins per non-STEM crs, wk 2-6 >=11.3 or
missing
Avg. HS SAT CR >570
SAT Math CR >1360
AP STEM
Crs>=1
Avg. GPA = 3.63
N = 46
AP Stem
Crs = 0
Avg. GPA = 3.20
N = 23
SAT Math CR <=1360
Highest DFW STEM Crs.
Rate>= 17%
Avg. GPA = 2.92
N= 34
Highest DFW STEM Crs. Rate <17%
Avg. GPA = 3.25
N=94
Avg. HS SAT CR<=570
Logins per STEM crs,
wk 2-6 >=32.2
SAT Math >=680
Avg. GPA = 3.35
N=78
SAT Math<680 or miss.
Avg. GPA = 3.09
N = 121
Logins per STEM crs,
wk 2-6 <32.2
Non-STEM
crs logs > = 3 or
miss.
Avg. GPA = 2.94
N = 371
Non-STEM
crs logins
<3
Avg. GPA = 2.53
N = 57
LMS logins per non-STEM crs, wks 2-6<11.3
Avg. HS SAT CR >=540
AP STEM Crs. >=1
STEM crs logs
Wk. 1>=5 or
miss.
Avg. GPA = 3.21
N = 64
STEM crs logs
Wk 1 < 5
Avg. GPA = 2.69
N=16
AP STEM Crs = 0
STEM logs
Wk. 1 >=5 or miss.
Avg. GPA = 2.75
N = 73
STEM crs logs Wk. 1
<5
Avg. GPA = 2.12
N= 18
Avg. HS SAT CR < 540
Logs per STEM crs, wk 2-6 >=5.3 or
miss
STEM crs logs
Wk 1 >=1 or miss.
Avg. GPA = 2.62
N = 305
STEM crs
kogs Wk 1 = 0
Avg. GPA = 1.94
N = 25
Logs per STEM
crs. wk 2-6 < 5.3
Avg. GPA = 1.59
N = 13
Decision Tree Model for F14 Freshmen GPA: Part 1—HS GPA <= 92.0
Office of Institutional Research, Planning & Effectiveness
HS GPA>92.0 or Missing
Scholarship = Yes
HS GPA >=96.5 or missing
Math Placement Exam >= 5
Logs per
STEM Crs.,
wks 2-6 >=15.6
Avg. GPA = 3.63 N = 285
Logs per
STEM Crs, wk
2-6 <15.6
Avg. GPA 3.40
N = 83
Math Placement Exam < 5
Ethnic Group
= White, Hisp.
Avg. GPA
=3.50
N= 73
Ethnic Group=Asian,
Afr. Amer., Unk.
Avg. GPA = 3.05
N=30
HS GPA < 96.5
Logs per non-STEM crs,wks
2-6 >=29.1
SAT Math>=700
Avg. GPA = 3.76 N=26
SAT Math<700 or miss.
Avg. GPA = 3.52
N = 74
Logs per non-STEM crs,wks
2-6 <29.1
Avg HS. CR, M Wrt
>=1830 miss
Avg. GPA = 3.59
N = 54
Avg. HS CR,
M, Wrt< 1830
Avg. GPA = 3.13
N = 54
Scholarship = No
LMS logins per non-STEM crs. Wk 2-6 >=10.4
AP STEM Crs. >=1
DFW STEM
Crs Total >=2
Avg. GPA = 3.23
N = 163
DFW STEM
Crs Total
<2
Avg. GPA = 3.49
N=101
AP STEM Crs = 0
SAT Math >=760
Avg. GPA = 3.76
N = 11
SAT Math <760
Avg. GPA = 3.03
N= 194
LMS logins per non-STEM crs. wk 2-6 < 10.4
Logs per STEM crs, wks 2-6 >=10.9 or
miss.
DFW non-STEM 1st yrs >=28%
Avg. GPA = 3.05
N = 72
DFW non-STEM 1st yrs <28%
Avg. GPA = 2.90
N = 73
Logs per STEM crs. wks 2 6 <
10.9
STEM Crs logs
Wk 1 >=8
Avg. GPA= 1.30 N=11
Avg. GPA = 2.52
N = 16
STEM Crs logs
Wk 1 <8 or miss
Decision Tree Model for F14 Freshmen GPA: Part 2—HS GPA > 92.0
Office of Institutional Research, Planning & Effectiveness
How Can the Results be Used?
• The model as presented can be used to assign students to
designed interventions, e.g., tutoring.
• Model results can be shared with departments to inform their
advising and intervention efforts.
• The goal is to find the students who need assistance to fulfill
their potential, and reduce the number who end up leaving due
to poor performance.