capstone poster microsoft - columbia datascience · data science capstone project with...

Polling Data Empirical Error Modeling Acknowledgments We would like to show our gratitude to Columbia Data Science Institute for giving us the opportunity for this capstone project. The project was supported by Dr. David Rothschild from Microsoft Research and Professor Andrew Gelman from Columbia University. We thank Tobias Konitzer, C.S.O and Co-founder of PredictWise who provided insight and expertise that greatly assisted the project. Yingxin Zhang, Jiaying Zhang, Jingyu Ren, Qinwei Zhao David Rothschild (Industry Mentor) Andrew Gelman (Faculty Mentor) Project Description Public opinion polls have long been used to predict election results. Traditional probability-based sampling methods, such as random-digit dialing, are slow and expensive. To compare with, non-probability based sampling methods are proved to be faster, cheaper, and perhaps more accurate way to collect public opinion polls. Building on to the belief that total survey error in election polls has always been understated, the goal of this project is to figure out the smallest sample size that can minimize total survey error, therefore derives a cheaper, more accurate, and more efficient polling method. References Houshmand Shirani-Mehr, David Rothschild, Sharad Goel & Andrew Gelman (2018) Disentangling Bias and Variance in Election Polls, Journal of the American Statistical Association, 113:552, 607-614, DOI: 10.1080/01621459.2018.1448823 Data and Methodology The data we used was a population level dataset taken from voter file and census, provided by Microsoft Research. It had total of 9000+ rows and contained demographic information such as party preference, gender, age, race, education, etc.. Our methodology is to fit different classification models with different sample sizes drawing from simple random sampling. The best model would result in a smallest sample size that makes sample coefficients converge to population coefficients. Figure 1. Relationships between parties’ voting counts, voter’s age and education background (left); Independent relationship between total voting counts with gender, party, race, education, age and urbanicity (right). Conclusions This experiment shows that with much smaller size, we can still achieve the same prediction accuracy as the population level prediction. In this way, we can reduce 92% of the data collected and can significantly lower the cost and time. To obtain a higher prediction result and lower the cost, future work can be done on designing polling questions. Results For each sample size, we ran the following models 100 times: Random Forest, SDG Classifier, KNN, SVM Classifier, and XGBoost. The graphs below show the average output coefficients for SVM Classifier, which has the high prediction accuracy on the hold-out set, and its corresponding 95% confidence interval. The horizontal line near 0 represents the coefficient value for the entire dataset. As we can see from the graphs , sample coefficients starting to converge to population coefficient roughly at sample sizes of 500-600. Figure 2. Mean coefficient plots for SVM Classifier with 95% confidence interval Data Science Capstone Project with Microsoft Research

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: capstone poster microsoft - Columbia DataScience · Data Science Capstone Project with MicrosoftResearch. Title: capstone_poster_microsoft Created Date: 12/6/2018 3:42:06 PM

Polling Data Empirical Error Modeling

Acknowledgments

We would like to show our gratitude to Columbia Data Science Institute for giving us the

opportunity for this capstone project. The project was supported by Dr. David Rothschild from

Microsoft Research and Professor Andrew Gelman from Columbia University. We thank Tobias

Konitzer, C.S.O and Co-founder of PredictWise who provided insight and expertise that greatly

assisted the project.

Yingxin Zhang, Jiaying Zhang, Jingyu Ren, Qinwei Zhao

David Rothschild (Industry Mentor)

Andrew Gelman (Faculty Mentor)

Project Description

Public opinion polls have long been used to predict election results. Traditional

probability-based sampling methods, such as random-digit dialing, are slow and

expensive. To compare with, non-probability based sampling methods are proved to be

faster, cheaper, and perhaps more accurate way to collect public opinion polls.

Building on to the belief that total survey error in election polls has always been

understated, the goal of this project is to figure out the smallest sample size that can

minimize total survey error, therefore derives a cheaper, more accurate, and more

efficient polling method.

References

Houshmand Shirani-Mehr, David Rothschild, Sharad Goel & Andrew Gelman (2018)

Disentangling Bias and Variance in Election Polls, Journal of the American Statistical Association,

113:552, 607-614, DOI: 10.1080/01621459.2018.1448823

Data and Methodology

The data we used was a population level dataset taken from voter file and census,

provided by Microsoft Research. It had total of 9000+ rows and contained demographic

information such as party preference, gender, age, race, education, etc..

Our methodology is to fit different classification models with different sample sizes

drawing from simple random sampling. The best model would result in a smallest

sample size that makes sample coefficients converge to population coefficients.

Figure 1. Relationships between parties’ voting counts, voter’s age and education

background (left); Independent relationship between total voting counts with

gender, party, race, education, age and urbanicity (right).

Conclusions

This experiment shows that with much smaller size, we can still achieve the same

prediction accuracy as the population level prediction. In this way, we can reduce 92%

of the data collected and can significantly lower the cost and time. To obtain a higher

prediction result and lower the cost, future work can be done on designing polling

questions.

Results

For each sample size, we ran the following models 100 times: Random Forest, SDG

Classifier, KNN, SVM Classifier, and XGBoost. The graphs below show the average output

coefficients for SVM Classifier, which has the high prediction accuracy on the hold-out

set, and its corresponding 95% confidence interval. The horizontal line near 0 represents

the coefficient value for the entire dataset. As we can see from the graphs , sample

coefficients starting to converge to population coefficient roughly at sample sizes of

500-600.

Figure 2. Mean coefficient plots for SVM Classifier with 95% confidence interval

Data Science Capstone Projectwith Microsoft Research

DataScience Lab 2017_Обзор методов детекции лиц на изображение

Who is the Brussels datascience community ?

ConvexOptimization:Algorithmsand Complexity - …sbubeck.com/Bubeck15.pdfConvexOptimization:Algorithmsand Complexity SébastienBubeck TheoryGroup,MicrosoftResearch ... of convex optimization,

Aaditya K. Ramdas - stat.cmu.eduaramdas/AadityaRamdasCV.pdf · Other positions 2019Visitingresearcher,MicrosoftResearch(6weeks),Montreal. 2014Visitingresearcher,GatsbyNeuroscienceUnit,UCL(6weeks),London

Combinatorial’meets’Convex - sorbonne-universiteplc/data2015/orecchia2015.pdf · LorenzoOrecchia’ Department’of’Computer’Science’ Challenges)in)Optimization)for)DataScience

Part1: Quest for DataScience 101

Handbook Datascience Researchers

Capstone Collegiate Communities - Relationsrelations.gmu.edu/wp-content/uploads/2018/10/... · Capstone Collegiate Communities ... •Capstone Collegiate Communities (“Capstone”)-

LEARN BY DOING BY-SASMITA PANIGRAHI DataScience Training

Garuda Robotics x DataScience SG Meetup (Sep 2015)

Capstone Partners - Capstone Headwaterscapstoneheadwaters.com/sites/default/files/SaaS... · 5 Capstone Partners Investment Banking Advisors Capstone Partners Investment Banking Advisors

Unlinkability: A+DataScience+Perspec5ve+ · given+side+info++answers+

Nielsen x DataScience SG Meetup (Apr 2015)

ConvexOptimization:Algorithmsand Complexity · · 2018-01-04ConvexOptimization:Algorithmsand Complexity SébastienBubeck TheoryGroup,MicrosoftResearch sebubeck@ ... We brieﬂy

Student Capstone Guidebook (adapted) 07OCT2019blogs.vsb.bc.ca/magee-clc-capstone/files/2019/10/Student-Capstone... · CLC 12 – Capstone Project Guidebook 9 3. The Project Capstone

2016 datascience emotion analysis - english version

Telecom datascience master_public

Master M2 - DataScience - Telecom Paris

REALIZING)THE POTENTIAL)OF)DATA SCIENCE · REALIZING)THEPOTENTIAL)OF)DATASCIENCE) Final)Report)from)the)National)Science)Foundation)Computer)and) InformationScienceandEngineering)Advisory)Committee)DataScience

Splunk for DataScience (.conf2014)

DataScience@NIH: Pivoting to the Future...DataScience@NIH: Pivoting to the Future Michael F. Huerta, Ph.D. Coordinator of Data Science & Open Science Associate Director for Program

Ticketmaster datascience

DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых снимков Алексей Кравченко

Marlabs DataScience Capabilities

Areﬂectionontypes · 2019. 7. 16. · Areﬂectionontypes SimonPeytonJones1,StephanieWeirich 2,RichardA.Eisenberg ,and DimitriosVytiniotis1 1 MicrosoftResearch,Cambridge 2

CROWD-SOURCED DATASCIENCE COMPETITIONS

Winning the I-COM Datascience Hackathon 2016eprints.networks.imdea.org/1362/1/comsotec2016.pdfWinning the I-COM Datascience Hackathon 2016 L.F. Chiroque, R. Cuevas, J.M. Carrascosa,

HowtoRunPOSIXAppsinaMinimalPicoprocess - … JonHowell,BryanParno,JohnR.Douceur MicrosoftResearch,Redmond,WA Abstract WeenvisionafuturewhereWeb,mobile,anddesktop

Du datamining à la datascience

capstone powerpoint presentation · capstone powerpoint presentation Author: CapstoneProject.net Subject: Education Keywords: capstone powerpoint presentation, capstone presentation

Workshop: R for datascience · Workshop: R for datascience Laurent Rouvière 2019, september CONTENTS Introduction 2 Someexamples 3 Outline 7 Rstudio,RmarkdownandR-packages8 Robjects(Review)

Implementando #BigData #Analytics #DataScience com # ...sucesurs.org.br/sites/default/files/2020-03/Implementando BigData... · Implementando #BigData #Analytics #DataScience com

Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing

DataScience&AI - cuti.org.uy