sociology 690 – data analysis

25
Sociology 690 – Data Analysis Simple Quantitative Data Analysis

Upload: mayda

Post on 10-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Sociology 690 – Data Analysis. Simple Quantitative Data Analysis. Four Issues in Describing Quantity. 1. Grouping/Graphing Quantitative Data 2. Describing Central Tendency 3. Describing Variation 4. Describing Co-variation. 1. Grouping Quantitative Data. Intervals and Real Limits - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sociology 690 – Data Analysis

Sociology 690 – Data Analysis

Simple Quantitative

Data Analysis

Page 2: Sociology 690 – Data Analysis

Four Issues in Describing Quantity

1. Grouping/Graphing Quantitative Data

2. Describing Central Tendency

3. Describing Variation

4. Describing Co-variation

Page 3: Sociology 690 – Data Analysis

1. Grouping Quantitative Data

Intervals and Real Limits

Widths and midpoints

Graphing grouped data

If there are a large number of quantitative scores, one would not simply create a raw score frequency distribution, as that would contain too many unique scores and, therefore, not fulfill the data reduction goal.

Page 4: Sociology 690 – Data Analysis

Grouping Data - Intervals

To group quantitative data, three rules are followed:

– 1. Make the intervals no greater than the most amount of information you are willing to lose.

– 2. Make the intervals in multiples of five.

– 3. Make the distribution intervals few enough to be internalized at a glance.

Page 5: Sociology 690 – Data Analysis

Grouping Data – Intervals Example

If these are the scores on a midterm:

{9,13,18,19,22,25,31,34,35,36,36,38,41,43,44,45}

The corresponding grouped frequency distribution would look like:

i fi01-10 111-20 321-30 231-40 641-50 4Total 16

Page 6: Sociology 690 – Data Analysis

Grouping Data - Real Limits

This implies the need for real limits as there are “gaps” in these intervals. The real limits of an interval are characterized by numbers that are plus and minus one-half unit on each side of stated limits:

For example: – the interval 11-20 becomes 10.5 – 20.5– the interval 3.5 – 4.5 becomes 3.45 – 4.55

Page 7: Sociology 690 – Data Analysis

Grouped Data – Width and Midpoint

The width of an interval is simply the difference between the upper and lower real limits.

e.g. 11-20 20.5 – 10.5 = 10

The midpoint is determined by calculating the interval width, dividing it by 2, and adding that number to the lower real limit.

e.g. 10/2 + 10.5 = 15.5

Page 8: Sociology 690 – Data Analysis

Graphing Grouped Data

A Quantitative version of a bar graph is called an Histogram:

When the frequencies are connected via a line, it is call a frequency polygon:

0

1

2

3

4

5

6

7

01-10 11-20 21-30 31-40 41-500

1

2

3

4

5

6

7

01-10 11-20 21-30 31-40 41-50

Page 9: Sociology 690 – Data Analysis

2. Describing Central Tendency

Modes

Medians

Means

Skew

But we can do more than simply create a frequency distribution. We can also describe how these observations “bunch up” and how they “distribute”. Describing how they bunch up involves measures of

Page 10: Sociology 690 – Data Analysis

Central Tendency - Modes

The mode for raw data is simply the most frequent score: e.g. {2,3,5,6,6,8}. The mode is 6.

The mode for grouped data is the midpoint of the interval containing the highest frequency (35.5 here):

i fi01-10 111-20 321-30 231-40 641-50 4Total 16

Page 11: Sociology 690 – Data Analysis

Central Tendency - Medians

The median for raw data is simply the score at the middle position. This involves taking the (N+1)/2 position and stating the associated value attached to it:

e.g. {2,3,5,6,8} (5+1)/2 the third position score

The third position score is 5.

e.g. {2,3,5,8} (4+1)/2 the 2.5 position score

The 2.5 position score is (3+5)/2 = 4

Page 12: Sociology 690 – Data Analysis

Medians for Grouped Data

The median for grouped data is:

For our previous distribution of scores, the answer would be:

30.5 +((16/2-6)/6)*10 = 30.5 + 3.33 = 33.83

if

CumfNX

i

llll *

)2/(

i fi

01-10 111-20 321-30 231-40 641-50 4Total 16

Page 13: Sociology 690 – Data Analysis

Central Tendency - Mean

For raw data, the mean is simply the sum of the values divided by N:

Suppose Xi = { 2,3,5,6} The mean would be 16/4 = 4

N

X i

Page 14: Sociology 690 – Data Analysis

Means for Grouped Data

For grouped data, the mean would be the sum of the frequencies times midpoints for each interval, that sum divided by N:

For our previous distribution, the answer would be:

i fi 01-10 1 11-20 3 1(5.5)+3(15.5)+2(25.5)+6(35.5) 21-30 2 4(45.5) = 498 / 16 = 31.125

31-40 6 41-50 4 Total 16

N

mf ii

Page 15: Sociology 690 – Data Analysis

3. Describing Variation

Range

Mean Deviation

Variance

Standard Scores (Z score)

Page 16: Sociology 690 – Data Analysis

Describing Variation - Range

The Range for raw scores is the highest minus the lowest score, plus one (i.e. inclusive)

The Range for grouped scores is the upper real limit of the highest interval minus the lower real limit of the lowest interval. In the case of our

previous distribution this would be

50.5 - .5 = 50

i fi

01-10 1

11-20 3

21-30 2

31-40 6

41-50 4

Total 16

Page 17: Sociology 690 – Data Analysis

Describing Variation – Mean Deviation

The mean deviation is the sum of all deviations, in absolute numbers, divided by N.

Consider the set of observations, {6,7,9,10} The mean is 8 and the MD is (|6-8|+|7-8|+|9-8|+|10-8|)/4 = 6/4 = 1.5

N

XXMD

i

Page 18: Sociology 690 – Data Analysis

Mean Deviation for Grouped Data

Again grouped data implies we substitute frequencies and midpoints for values: N

mfMD ii

I f

$36-40,000 6 $41-45,000 8 $46-50,000 12 $51-55,000 12 $56-60,000 8 $61-65,000 4

--------Total 50

The mean would be $50,000 (satisfy yourself that that is true) and the MD would be (6|38-50|) + (8|43-50|) + (12|48-50|) + (12|53-50|) + (8|58-50|) +(4|63-50|) = 72+56+24+36+64+52 = 304/50 = 6.080 x 1000 = 6,080

Page 19: Sociology 690 – Data Analysis

Variation – The Variance

The variance for raw data is the sum of the squared deviations divided by N

Consider the set Xi { 6,7,9,10} The mean is 8 and the variance is ((6-8)2+(7-8)2+(9-8)2+(10-8)2)/4 = 2.5

N

X i 2

Page 20: Sociology 690 – Data Analysis

Variance for Grouped Data

Frequencies and midpoints are still substituted for the values of Xi.

N

mf ii 2)(

I f

$36-40,000 6 $41-45,000 8 $46-50,000 12 $51-55,000 12 $56-60,000 8 $61-65,000 4

--------Total 50

Again the mean is 50 and the Variance is 6(38-50)2 + 8(43-50)2 + 12(48-50)2 + 12 (53-50)2 + 8(58-50)2 + 4(63-60)2 = 1014 + 392 + 48 + 108 + 512 + 676 = 2690 / 50 = 53.8 x 1000 = $53,800. The Standard Deviation is the sq root of this.

Page 21: Sociology 690 – Data Analysis

4. Covariance and Correlation

The Definition and Concept

The Formula

Proportional Reduction in Error and r2

Page 22: Sociology 690 – Data Analysis

Correlation – Definition and Concept

Visually we can observe the co-variation of two variables as a scatter diagram where the abscissa and ordinate are the quantitative continua and the points are simultaneously mapping of the pairs of scores.

Page 23: Sociology 690 – Data Analysis

Correlation - Formula

Think of the correlation as a proportional measure of the relationship between two variables. It consists of the co-variation divided by the average variation:

22 *)(

))((

XXXX

YYXXr

Page 24: Sociology 690 – Data Analysis

Correlation and P.R.E.

Y

'Y

Consider this scatter diagram. The proportion of variation around the Y mean (variation before knowing X), less the proportion of variation around the regression line (variation after knowing x) is r2

22

2

2

2'

rYY

YY

YY

YY

Page 25: Sociology 690 – Data Analysis

IV. Quantitative Statistical Example of Elaboration

Step 1 – Construct the zero order Pearson’s correlations (r).

Assume rxy = .55 where x = divorce rates

and y = suicide rates.

Further, assume that unemployment rates (z) is our control variable and that rxz = .60 and ryz = .40

Step 2 – Calculate the partial correlation (rxy.z)

= = .42

Step 3 – Draw conclusions

After z (rxy.z)2 = .18

Before z (rxy)2 = .30 Therefore, Z accounts for (.30-.18) or 12% of Y and (.12/.30) or 40% of the relationship between X&Y

.55 – (.6) (.4)

16.136.1

Partial Correlation