harness racing and sas

15
HARNESS RACING AND SAS USING SAS TO MODEL HORSE RACES

Upload: jody-schechter

Post on 14-Dec-2014

88 views

Category:

Sports


2 download

DESCRIPTION

Harvard Stats 135 midterm project evaluating SAS techniques.

TRANSCRIPT

Page 1: Harness Racing and SAS

HARNESS RACING AND SASUSING SAS TO MODEL HORSE RACES

Page 2: Harness Racing and SAS

• “Past Performance” from TrackMaster for races September 26, 2013 at Yonkers Raceway

• Published in advance of the race

• Cost: $1.50

• Comes in XML format – parsed using python

• Contains 10 most recent PPs for each horse racing that day

• 12 races x 8 horses x 10 past performances = 960 records

• Variables of use: Lengths back at each quarter, final time, lead final time, gait, age (meta), track condition, track name, track length

• Created race-level, horse-race-level, and longitudinal data sets for different aspects of this analysis

DATA SET

Page 3: Harness Racing and SAS

GAIT AND CONDITION• Hypothesis: Gait and track condition influence race time

• Gait

• Binary: Pacers and Trotters• Each race is one or the other• Each horse is one or the other

• Condition

• Categorical: Fast, Good, or Sloppy• Each race categorized into one

• Created and cleaned race-level data set

• Means test showed means are different for both variables

• T-test showed these differences are statistically significant

Page 4: Harness Racing and SAS

REMOVING OUTLIERS

Page 5: Harness Racing and SAS

REMOVING OUTLIERS

Page 6: Harness Racing and SAS

GAIT T-TEST

Page 7: Harness Racing and SAS

CONDITION T-TEST

Page 8: Harness Racing and SAS

CORRELATION: LENGTHS BACK AT CALLS• Some horses pull away early, others seem to wait for the

last quarter to go to the front

• TrackMaster reports lengths back from lead and calls at each quarter

• Lengths are recorded as fractional numbers (to the quarter) and as parts of horse

• Nose• Head• Neck

• Additional complication: “costly breaks” of pace and disqualification

• Still not happy – strange lengths back for winners at final

Page 9: Harness Racing and SAS

CORRELATION OF LENGTHS BACK BY QUARTER

Page 10: Harness Racing and SAS

CORRELATION OF LENGTHS BACK BY QUARTER

Page 11: Harness Racing and SAS

• Goal: Quantify how much horses slow down with age

• Merged metadata for each horse with past performance data

• Single-variable regression analysis of mean data set

• Found that age is not a great predictor of speed

• Age: Discrete, yet not categorical

AGE AND SPEED

Page 12: Harness Racing and SAS

• Longitudinal data set

• Created dummy variables for past and present track conditions, gaits, and track sizes

• Used SAS’s “Lag” and “Last” Features

• Removed disqualified races

• Modeled race time based on current race conditions and two races prior

MULTIVARIATE REGRESSION

Page 13: Harness Racing and SAS

Label ParameterEstimate

StandardError

t Value Pr > |t|

Intercept 104.67788 4.81142 21.76 <.0001

Lag final time

0.01412 0.03120 0.45 0.6510

Lag2 final time

0.11361 0.02975 3.82 0.0001

Pacer -3.68185 0.21247 -17.33 <.0001

Fast -0.77005 0.38954 -1.98 0.0484

Sloppy 0.86942 0.43605 1.99 0.0465

Age 0.05312 0.04023 1.32 0.1871

5/8 Track -2.74052 0.20313 -13.49 <.0001

1 Track -3.18411 0.47824 -6.66 <.0001

MULTIVARIATE REGRESSION

Label ParameterEstimate

StandardError

t Value Pr > |t|

Fast lag 0.35883 0.38598 0.93 0.3528

Sloppy lag 0.48532 0.43151 1.12 0.2610

Fast lag2 0.09472 0.37245 0.25 0.7993

Sloppy lag2

-0.39904 0.42068 -0.95 0.3431

5/8 Track lag

0.14639 0.23680 0.62 0.5366

1 Track lag 0.40192 0.51792 0.78 0.4379

5/8 track lag2

0.58564 0.21764 2.69 0.0073

1 track lag2

0.67260 0.49172 1.37 0.1717

Variables of Interest Control Variables

Final race times from previous races are not great determinants of final race time this race!

Page 14: Harness Racing and SAS

Predicting the Winner

RightWrong

• Used the coefficients from my multivariate regression and most recent two races for each horse

• Ranked horses by predicted race values

• But my bets weren’t great! But better than choosing at random!

• Reason: Low, low variance in race times among horses. Not enough predictive power in model, even with R^2 > 0.5

PREDICTION OF SEPTEMBER 26 RACES

Page 15: Harness Racing and SAS

• SAS’s LAG and LAST features are great for dealing with longitudinal data

• Most work was on the DATA steps, not the PROC steps

• My model was based on only 960 occurrences, 96 horses

• With more data, might model Pacers and Trotters separately, Conditions separately

• Still want to investigate lengths back for winning horses

• Learned much about SAS and about harness racing

FINAL THOUGHTS