![Page 1: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/1.jpg)
2016 Democrat Primary: Prediction of results For New York Counties
Lavneet Sidhu | Nikita Bali | Sanjita Jain | Subhasree Chatterjee
![Page 2: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/2.jpg)
OBJECTIVE
Predicting the number of counties won by the Democrats in the primary US Presidential Election for the New York state based on demographic data of all other US counties
To find out if there is any pattern of how people vote based on their demographic information
![Page 3: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/3.jpg)
DATA
29 states
1928
counties
33 Demographic
variables
Primary Results County Facts
1542Training set
counties80 percent
386Testing set counties
20 percent
Population %, Female %, Different ethinicity %
Educational background, income, Number of votes
![Page 4: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/4.jpg)
Explanatory data analysis
Distribution of votes between Democrats
as per demographic information
The explanatory data analysis was done using python.
Correlation Matrix:1. The highest correlation is between the
population percentage where language other than English is spoken at home and Population that is either Hispanic or Latino.
2. There is also a very high correlation in population that is not born in the US and population where language other than English is spoken at home and also with population that is either Latino or Hispanic.
![Page 5: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/5.jpg)
Explanatory Data Analysis
% African American vs Fraction of votes
% non English speaking vs. Fraction
of votes
% of females vs.
Fraction votes
% over 65 vs. Fraction votes
![Page 6: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/6.jpg)
Logistic Regression ModelClinton
0
Sanders
1Winner
Full Model
Step AIC
Step BIC
AIC 1086.87 1068.88
1076.96
BIC 1268.46 1181.05
1157.08
AUC (training)
0.918 0.917 0.913
AUC (testing) 0.794 0.760 0.771
Model Selecti
on
Response Variable
Variable
Selection
Age, % females, % whites, % Afro-American, % native Indians, % Hispanic Latino, % foreign born, % education, % veterans, home ownership rate, median value of house, person per household, per capita income, % of Asian owned firms etc.
![Page 7: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/7.jpg)
ROC and Misclassification Rate
Training ROC Testing ROC
Clinton
Sanders
Clinton 920 131Sanders
124 367Misclassification Rate: 0.165
Clinton
Sanders
Clinton 226 36Sanders
35 89Misclassification Rate: 0.184
![Page 8: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/8.jpg)
Random ForestClassificati
on Type
1000 Trees
5 Variables tried at
each split
17.3% OOB
estimate of error rate
Clinton
Sander
Class.Error
Clinton 1169 136 0.104Sanders
197 426 0.316
Confusion Matrix
Importance
![Page 9: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/9.jpg)
Principal Component Analysis Regression
The data is standardized to perform principal component analysis on the demographic data. It gives us 33 uncorrelated components. We can consider 8 of the 33 components for further analysis as they explain 80% of the variance in the data
Clinton
Sanders
Clinton 1129 176Sanders
256 367
Importance of
componentsROC
TestingAUC = 86%
Testing
Confusion Matrix
Testing
Error = 22%
Testing
![Page 10: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/10.jpg)
Model Validation
Washington
39 counties
Hawaii5 counties
Alaska29 counties
0 39
C S
5 34
C S
0 5
Logistic Model
2 3
2 27
Actual
PredictedRandom
Forest
4 35
C S
0 5
1 28
PCA Regression
2 37
C S
0 5
0 29
0 29
![Page 11: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/11.jpg)
Factor Analysis The purpose of factor analysis is find out some unobserved variables which
will be lower in number and uncorrelated in comparison to the observed variable.
By using those factors we should be able to differentiate the voting pattern for democrat candidates based on demographic data of the county.
We tried the factor analysis on the following levels:1. County demographic data2. State demographic data3. Winner wise demographic data
![Page 12: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/12.jpg)
Factor Analysis(Cont’d) We got 2 factors for State and County Demographic data
1st factor describes ethnicity information. 2nd factor is based on population and industrial exposure.
State
County
All states seem to exhibit similar behavior except
Hawaii, Alaska & District of Columbia
All counties seem to exhibit similar behavior
![Page 13: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/13.jpg)
Factor Analysis (Cont’d) We got 3 factors for winner based demographic data. • Factor 1 concentrates on the population and the median income of that county.• Factor 2 can be interpreted as the Hispanic and non-native American population. • Factor 3 can be interpreted as economic prosperity and white/black population of the county
Clinton gets majority of the votes from the counties where median income is higher and non-native and Hispanic Americans are more.
![Page 14: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/14.jpg)
NEW YORK RESULTS
New York62 counties 13 4
9
C S
25
37
C S
Logistic Model
Actual
PredictedRandom
Forest
27
35
C S
PCA Regression
6 56
C S
![Page 15: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/15.jpg)
CONCLUSION
Hillary Clinton seems to be favored in counties where:• Median Income is higher• Percentage of Hispanic, African American population is higher
People who vote Sanders are majority Whites Similar results were obtained from different modeling techniques
![Page 16: 2016 Democrat Primary: Prediction of results For New York Counties](https://reader035.vdocuments.us/reader035/viewer/2022062522/589e59751a28ab16348b5085/html5/thumbnails/16.jpg)
Thank You