ir 514: multivariate analysis lecture 3 · nuts and bolts • for more on logged variables, see pp....
TRANSCRIPT
![Page 1: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/1.jpg)
IR 514: Multivariate AnalysisLecture 3
Benjamin Graham
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 2: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/2.jpg)
Nuts and Bolts
• For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166
• For next week, readings are:• Review KKV Chapter 1, Gujarati and Porter Intro and Chapt 1• .pdf of KKV I will e-mail out.
• The big picture:• Data management is frontloaded in the course• Practice analysis using data you’ll use later
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 3: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/3.jpg)
Today’s Plan
• Talk about the homework• Weak Law of Large Numbers and the CLT• Reviewing some terminology• Hypothesis testing• Some intuition on linear regression
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 4: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/4.jpg)
Grading for your Homework
• The homework assignments are cumulatively pass/fail• I require good faith effort on all homework
• Gloria will post example .do file• There is more than one correct way to write the code• There is more than one correct answer for substantive questions
• I won’t grade unless I have to• So its your responsibility to get the feedback you need.• I will always give you an opportunity in class• Also office hours
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 5: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/5.jpg)
Homework 1
• How’d it go? Any questions?• Which countries ended up being dropped from the WDI data? Is that a
problem?• How rich are countries in 2010? • Content validity vs reliability
• Proxy measures
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 6: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/6.jpg)
Homework 1: Part 3
• What other datasets did you download? Problems, challenges?• Let’s say you wanted to use this data for a paper. What are the next steps?• Merging with other data, cleaning data
• Read Vreeland’s Goldilocks story (and skim his JCR article)• link is on my website
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 7: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/7.jpg)
Weak Law of Large Numbers
• I’m ducking the math on this but...• If you take a random sample of observations from a population...
• The mean of the sample approaches the mean of the population as sample size approaches infinity
• This is great, but we have finite samples. • So how close is the mean of our sample likely to be to the mean of our
population?• THIS is why we need the Central Limit Theorem.
• The CLT tells us that our sample mean is part of a normal distribution (even if the population distribution is not normal).
• We also know the size of our sample (n) and the variance of our sample.• So we can use t-tests to estimate confidence intervals around our estimate of the
population mean. • One-sample t-test
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 8: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/8.jpg)
Why is the Central Limit Theorem Important (2)
• When we run a difference of means test, we’re comparing the means of two samples drawn from two populations.
• We know that even if the population distribution is wacky, the distribution of sample means is normal
• For each sample we can calculate n and sd. • Once we can assume that the distribution of sample means is normal, we can
use t-tests and other techniques to test hypotheses that specify which of two means we expect to be larger.
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 9: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/9.jpg)
Quick note before moving to stata
• To estimate the probability density function (pdf) of our data, we usually use a histogram or a kernel density plot.
• A kernel density plot is just a bunch of “kernel” distributions added up.• We usually use normal distributions
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 10: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/10.jpg)
More Data Management
• Cleaning data• Starts with the codebook• -999 codes for missing data in DPI• -66, -77, -88 all have meaning in Polity IV
• Catching typos/coding errors• tabulate ordinal measures• look for outliers in continuous variables
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 11: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/11.jpg)
Merging Datasets Together
• Data first needs to be in the same format• Usually we merge country year
• dyads as units• quarterly data• MIRPS data• cross-sectional data
• Sometimes we merge cross-sectional and tscs data together• merge by gwno vs. merge by gwno year• Example: Migrant stocks
• Shared unique identifiers for observations• duplicates tag gwno year, gen(dup)
• Data must be sorted• sort gwno year
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 12: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/12.jpg)
Merging Datasets Together (2)
• We rarely have perfect overlap• _merge variable tells us what our overlap looks like• always tab _merge and check out what cases don’t overlap• don’t forget to drop _merge after you look at it
• When datasets don’t overlap, observations are dropped from the analysis• Becomes missing data• May cause bias
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 13: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/13.jpg)
From t-tests to regression
• T-tests are great when...• We have a binary treatment variable• And only one treatment variable
• (move to whiteboard)• Regression allows us to handle continuous treatment variables
• Allows us to handle multiple treatment variables
SOCI 60: Lecture 24 Benjamin Graham
Friday, February 1, 13
![Page 14: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/14.jpg)
Regression analysis
SOCI 60: Lecture 24 Benjamin Graham
Friday, February 1, 13
![Page 15: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/15.jpg)
Ordinary Least Squares Regression (OLS)
• OLS is a form of linear regression -- so we’re assuming that the relationship between these two variables is a straight line.
• Usually we want to know, is the slope positive or negative?• Is the relationship (correlation) positive or negative?
• Least squares: we want to draw a line that minimizes the squared distances between the data points in our samples and the line.
SOCI 60: Lecture 24 Benjamin Graham
Friday, February 1, 13
![Page 16: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/16.jpg)
Regression Analysis
• What does this look like as a regression expression?• Remember y = mx + b as an equation for a line? • • GDP per capita is the “y” or dependent variable• alpha is the constant term (the b), which is the y intercept
• This is the value of GDP per capita we expect if democracy = 0• beta is the slope of the line (the m)
• This is the change in GDP per capita associated with a 1 unit change in democracy
• GDP per capita is the “x” term, or our explanatory variable
SOCI 60: Lecture 24 Benjamin Graham
GDP$per$capita$=$α$+$β*Democracy+$ε$
Friday, February 1, 13
![Page 17: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/17.jpg)
Regression analysis
SOCI 60: Lecture 24 Benjamin Graham
Friday, February 1, 13
![Page 18: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/18.jpg)
Regression analysis
SOCI 60: Lecture 24 Benjamin Graham
Friday, February 1, 13
![Page 19: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/19.jpg)
Regression analysis
SOCI 60: Lecture 24 Benjamin Graham
Friday, February 1, 13
![Page 20: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/20.jpg)
Regression Analysis
• The best possible estimates:• Slope: 487.35• Y intercept: 1860.98
SOCI 60: Lecture 24 Benjamin Graham
Friday, February 1, 13
![Page 21: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/21.jpg)
Regression analysis
SOCI 60: Lecture 24 Benjamin Graham
Friday, February 1, 13
![Page 22: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/22.jpg)
Regression Analysis
• y-hat is our “predicted value”• This is the point on the regression line for any given value of x.• This is the expected value of Y, given X.
• (see white board for expected values)
SOCI 60: Lecture 24 Benjamin Graham
Friday, February 1, 13
![Page 23: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/23.jpg)
Important misc concepts
• What’s the difference between a one-tailed test and a two-tailed test?
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 24: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/24.jpg)
One-tailed vs. two tailed tests
• The mean of our sample is different than zero• The slope of our regression line is different than zero
• The mean of our sample is greater than zero• The slope our regression line is greater than zero
IR 514: Towards Regression Benjamin Graham
Friday, February 1, 13
![Page 25: IR 514: Multivariate Analysis Lecture 3 · Nuts and Bolts • For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166 • For](https://reader034.vdocuments.us/reader034/viewer/2022042108/5e8774a8311002108d074439/html5/thumbnails/25.jpg)
Basic Notation• Y-dependent variable• X-explanatory or independent variables• u – error of the regression • Xk-kth explanatory variable• i or t-observation or value
– Xki is the ith observation on variable Xk
• N (or T)-total observations in a population• n-total observations in sample• i=cross sectional aspect of data • t=time series aspect of data
Friday, February 1, 13