ir 514: multivariate analysis lecture 3 · nuts and bolts • for more on logged variables, see pp....

IR 514: Multivariate AnalysisLecture 3

Benjamin Graham

IR 514: Towards Regression Benjamin Graham

Friday, February 1, 13

Nuts and Bolts

• For more on logged variables, see pp. 159-166 in Gujarati and Porter. • See especially last para on p. 166

• For next week, readings are:• Review KKV Chapter 1, Gujarati and Porter Intro and Chapt 1• .pdf of KKV I will e-mail out.

• The big picture:• Data management is frontloaded in the course• Practice analysis using data you’ll use later



Today’s Plan

• Talk about the homework• Weak Law of Large Numbers and the CLT• Reviewing some terminology• Hypothesis testing• Some intuition on linear regression



Grading for your Homework

• The homework assignments are cumulatively pass/fail• I require good faith effort on all homework

• Gloria will post example .do file• There is more than one correct way to write the code• There is more than one correct answer for substantive questions

• I won’t grade unless I have to• So its your responsibility to get the feedback you need.• I will always give you an opportunity in class• Also office hours



Homework 1

• How’d it go? Any questions?• Which countries ended up being dropped from the WDI data? Is that a

problem?• How rich are countries in 2010? • Content validity vs reliability

• Proxy measures



Homework 1: Part 3

• What other datasets did you download? Problems, challenges?• Let’s say you wanted to use this data for a paper. What are the next steps?• Merging with other data, cleaning data

• Read Vreeland’s Goldilocks story (and skim his JCR article)• link is on my website



Weak Law of Large Numbers

• I’m ducking the math on this but...• If you take a random sample of observations from a population...

• The mean of the sample approaches the mean of the population as sample size approaches infinity

• This is great, but we have finite samples. • So how close is the mean of our sample likely to be to the mean of our

population?• THIS is why we need the Central Limit Theorem.

• The CLT tells us that our sample mean is part of a normal distribution (even if the population distribution is not normal).

• We also know the size of our sample (n) and the variance of our sample.• So we can use t-tests to estimate confidence intervals around our estimate of the

population mean. • One-sample t-test



Why is the Central Limit Theorem Important (2)

• When we run a difference of means test, we’re comparing the means of two samples drawn from two populations.

• We know that even if the population distribution is wacky, the distribution of sample means is normal

• For each sample we can calculate n and sd. • Once we can assume that the distribution of sample means is normal, we can

use t-tests and other techniques to test hypotheses that specify which of two means we expect to be larger.



Quick note before moving to stata

• To estimate the probability density function (pdf) of our data, we usually use a histogram or a kernel density plot.

• A kernel density plot is just a bunch of “kernel” distributions added up.• We usually use normal distributions



More Data Management

• Cleaning data• Starts with the codebook• -999 codes for missing data in DPI• -66, -77, -88 all have meaning in Polity IV

• Catching typos/coding errors• tabulate ordinal measures• look for outliers in continuous variables



Merging Datasets Together

• Data first needs to be in the same format• Usually we merge country year

• dyads as units• quarterly data• MIRPS data• cross-sectional data

• Sometimes we merge cross-sectional and tscs data together• merge by gwno vs. merge by gwno year• Example: Migrant stocks

• Shared unique identifiers for observations• duplicates tag gwno year, gen(dup)

• Data must be sorted• sort gwno year



Merging Datasets Together (2)

• We rarely have perfect overlap• _merge variable tells us what our overlap looks like• always tab _merge and check out what cases don’t overlap• don’t forget to drop _merge after you look at it

• When datasets don’t overlap, observations are dropped from the analysis• Becomes missing data• May cause bias



From t-tests to regression

• T-tests are great when...• We have a binary treatment variable• And only one treatment variable

• (move to whiteboard)• Regression allows us to handle continuous treatment variables

• Allows us to handle multiple treatment variables

SOCI 60: Lecture 24 Benjamin Graham


Regression analysis



Ordinary Least Squares Regression (OLS)

• OLS is a form of linear regression -- so we’re assuming that the relationship between these two variables is a straight line.

• Usually we want to know, is the slope positive or negative?• Is the relationship (correlation) positive or negative?

• Least squares: we want to draw a line that minimizes the squared distances between the data points in our samples and the line.



Regression Analysis

• What does this look like as a regression expression?• Remember y = mx + b as an equation for a line? • • GDP per capita is the “y” or dependent variable• alpha is the constant term (the b), which is the y intercept

• This is the value of GDP per capita we expect if democracy = 0• beta is the slope of the line (the m)

• This is the change in GDP per capita associated with a 1 unit change in democracy

• GDP per capita is the “x” term, or our explanatory variable


GDP$per$capita$=$α$+$β*Democracy+$ε$


Regression analysis



Regression Analysis

• The best possible estimates:• Slope: 487.35• Y intercept: 1860.98



Regression analysis



Regression Analysis

• y-hat is our “predicted value”• This is the point on the regression line for any given value of x.• This is the expected value of Y, given X.

• (see white board for expected values)



Important misc concepts

• What’s the difference between a one-tailed test and a two-tailed test?



One-tailed vs. two tailed tests

• The mean of our sample is different than zero• The slope of our regression line is different than zero

• The mean of our sample is greater than zero• The slope our regression line is greater than zero



Basic Notation• Y-dependent variable• X-explanatory or independent variables• u – error of the regression • Xk-kth explanatory variable• i or t-observation or value

– Xki is the ith observation on variable Xk

• N (or T)-total observations in a population• n-total observations in sample• i=cross sectional aspect of data • t=time series aspect of data


ir 514: multivariate analysis lecture 3 · nuts and bolts • for more on logged variables, see pp....

Documents