cond operator-function, drop, and keep - sscc - homegwallace/papers/september 14, 2016.pdf ·...

1

PA 818 Professor Wallace September 14, 2016 Lecture:

1. More STATA stuff- the cond operator-function, drop, and keep 2. Descriptive methods for nominal and ordinal data using STATA

(continued) and Excel a. Frequency distributions b. Discrete histograms c. Pie charts

3. Describing relationships between two or more categorical variables a. Cross tabular frequency distribution b. Graphical techniques c. STATA examples, Excel Examples

4. Graphical descriptive techniques for interval and time series data. a. Histograms and frequency distributions b. Stem and leaf display (won’t get to) c. Ogives (O’jive) (won’t get to)

2

Descriptive Techniques for Categorical Data • Frequency distribution – a tabular description for nominal data that

list the number of units associated with each category

• Relative frequency distribution – tabular description for nominal data that list the fraction or percentage of units associated with each category

• Cumulative frequency distribution (ordinal only) – a tabular description for nominal data that list the cumulative (category and below) count, fraction, or percentage of units associated with each category

Example 1: Using the data in the updated CPS ORG file provide a frequency, relative frequency, and cumulative frequency distribution for grouped educational attainment for men in 2007. The education level categories should be less than high school, high school, some college (no degree), and 4 or more years of college. gen ed_level=cond(ed<12,1,0) replace ed_level=cond(ed==12,2,ed_level) replace ed_level=cond(ed>12 & ed<16,3,ed_level) replace ed_level=cond(ed>15,4,ed_level) tab ed_level

ed_level | Freq. Percent Cum. ------------+----------------------------------- 1 | 8,000 11.01 11.01 2 | 24,602 33.86 44.87 3 | 16,237 22.34 67.21 4 | 23,827 32.79 100.00 ------------+----------------------------------- Total | 72,666 100.00

3

Perhaps you are preparing this for your boss and want to provide more information about what the education levels actually mean. In this case we can make use of data labels of the type that already exist the sex and race variables. Let’s create some for ed_level. label define ed_levell 1 "High school dropout" /* */ 2 "High school" /* */ 3 "Some college" /* */ 4 "4 or more years of college" label values ed_level ed_levell ed_level | Freq. Percent Cum. ---------------------------+----------------------------------- High school dropout | 8,000 11.01 11.01 High school | 24,602 33.86 44.87 Some college | 16,237 22.34 67.21 4 or more years of college | 23,827 32.79 100.00 ---------------------------+----------------------------------- Total | 72,666 100.00

4

• Discrete histogram (a type of bar graph) – a graphical representation of a frequency distribution or relative frequency distribution whereby bars are associated with categories and the height of each bar on the graph represents the frequencies or relative frequencies associated with its corresponding category

Example: The same data described using a discrete relative histogram.

Distribution of Educational Attainment (male full-time, full-year workers in 2007)

010

2030

40P

erce

nt

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

5

• Pie chart – a graphical representation of the relative frequency distribution whereby a circle (or pie) is divided into slices with each slice representing a category and where the size of the slice is proportional to the relative frequency of its associated category.

Example 3: The same relative frequency distribution of educational attainment displayed in pie chart format.

Distribution of Educational Attainment (male full-time, full-year workers in 2007)

High school dropout

High school

Some college

4 or more years of college

23.29%

39.15%

18.22%

19.35%

6

• Deciding between frequency distributions, discrete histograms, and pie charts – there are tradeoffs.

o Graphs and charts verses tables Graphs and charts take up more room that the same

information displayed in tabular format, but they may be easier for some audiences to interpret.

Too many graphs or charts is generally a bad idea. Graphs may be better when providing descriptive

statistics related to one or two data items, rather than many items in a data set.

Tabular displays for nominal data can be integrated into tables which also provide basic descriptive for interval data.

o Discrete histogram versus pie chart In general discrete histograms are more efficient and

flexible than pie charts, but pie charts are probably better in communicating relative frequencies to lay audiences. I hardly every opt for the pie chart (see examples below).

Simple discrete histograms can be printed in black and white whereas pie charts usually need to incorporate some color or texture.

8

Table 1: Male Poverty per the Supplemental Poverty Measure, 2013

Group Percentage within Group that is Poor

Percentage of Poor Men

By Age 19-24 years 22.4 21.3 25-34 years 15.2 23.3 35-44 years 13.0 18.4 45-54 years 12.4 18.7 55-64 years 13.3 18.3 65+ years 11.9 4.8 Ages 19-64 By Race/Ethnicity White (non-Hispanic) 10.5 45.1 Black (non-Hispanic) 22.3 17.9 Hispanic 24.6 28.9 Other 16.1 8.2 By Region Northeast 13.2 15.4 Midwest 12.2 16.9 South 14.9 33.5 West 17.9 34.2 By Metro Metro 15.2 87.7 Non-metro 12.5 11.8 Not identifiable 10.8 0.6 By Family Non-family 25.7 30.3 Family without children 12.5 37.5 Family with children 12.4 32.2 By Head Family Type Married couple 9.4 37.9 Cohabitating couple 14.9 9.0 Male-headed family 21.6 11.4 Female-headed family 25.4 11.3 Male nonfamily 25.7 30.3 By Education Level Less than high school 32.2 24.9 High school, no college 17.4 36.4 Some college 13.4 26.4 College+ 6.3 12.3

9

Discrete Histograms in Stata Example 2: Distribution of education level for male full-time, full-year workers in 2007 hist ed_level, discrete percent

* The following edits were made in the STATA’s graph editor to get to the graph shown above:

• Bar Properties o bar width set to 0.5 o color set to black

• xaxis1 title – hidden in advanced tab • xaxis1 properties label properties

o use value labels checked o angle set at 45 degrees o Under range/delta – minimum value set to 1, max value set to

4, and delta set to 1 • Title and Subtitle added

010

2030

40P

erce

nt

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

(male full-time, full-year workers, 2007)Distribution of Educational Attainment

10

Example 3: Same information in a pie chart with labels graph pie, over(new_ed) plabel(_all name) /* */ plabel(_all percent)

* The following edits were made in STATA’s graph editor to get to the graph shown above:

o Legend – advanced tab – hide legend checked. o The percent and name labels were moved so that they don’t overlap o Title and subtitle added o Pielabel text changed to white

High school dropout

High school

Some college


11.01%

33.86%

22.34%

32.79%

(male full-time, full-year workers, 2007)Distribution of Educational Attainment

11

Describing relationships between two categorical variables • Cross tabular frequency distribution – a tabular display that shows

the absolute or relative frequencies associated with all combinations of two nominal variables.

• Stacked or Side-by-Side Discrete Histograms – a graphical display whereby multiple discrete histograms for different groups, but for the same categorical data, are shown stacked or side by side.

Example 4: How does educational attainment vary by race among men in 20007? Using a cross tabulated frequency distribution A cross tabular frequency distribution of race and educational attainment would help us answer this question tab ed_level race, col +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ | Equals 1 if non-Hisp white, 2 if non-Hisp | black, 3 if Hisp, and 4 if other race ed_level | White, No Black, No Hispanic Other | Total ----------------------+--------------------------------------------+---------- High school dropout | 3,369 511 3,790 330 | 8,000 | 6.43 8.91 38.12 7.19 | 11.01 ----------------------+--------------------------------------------+---------- High school | 17,490 2,373 3,488 1,251 | 24,602 | 33.38 41.38 35.09 27.27 | 33.86 ----------------------+--------------------------------------------+---------- Some college | 12,467 1,483 1,409 878 | 16,237 | 23.79 25.86 14.17 19.14 | 22.34 ----------------------+--------------------------------------------+---------- 4 or more years of co | 19,078 1,367 1,254 2,128 | 23,827 | 36.41 23.84 12.61 46.39 | 32.79 ----------------------+--------------------------------------------+---------- Total | 52,404 5,734 9,941 4,587 | 72,666 | 100.00 100.00 100.00 100.00 | 100.00

12

Another way tab race ed_level, row +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ Equals 1 if | non-Hisp white, 2 | if non-Hisp black, | 3 if Hisp, and 4 if | ed_level other race | High scho High scho Some coll 4 or more | Total --------------------+--------------------------------------------+---------- White, Non-Hispanic | 3,369 17,490 12,467 19,078 | 52,404 | 6.43 33.38 23.79 36.41 | 100.00 --------------------+--------------------------------------------+---------- Black, Non-Hispanic | 511 2,373 1,483 1,367 | 5,734 | 8.91 41.38 25.86 23.84 | 100.00 --------------------+--------------------------------------------+---------- Hispanic | 3,790 3,488 1,409 1,254 | 9,941 | 38.12 35.09 14.17 12.61 | 100.00 --------------------+--------------------------------------------+---------- Other | 330 1,251 878 2,128 | 4,587 | 7.19 27.27 19.14 46.39 | 100.00 --------------------+--------------------------------------------+---------- Total | 8,000 24,602 16,237 23,827 | 72,666 | 11.01 33.86 22.34 32.79 | 100.00

13

Example 5: How does educational attainment vary by race? Using discrete histograms

• Stacked

hist new_ed, discrete percent by(race, col(1))

*** We always want to make sure the categories and Y-axis scales are the same.

01020304050

01020304050

01020304050

01020304050

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

White, Non-Hispanic

Black, Non-Hispanic

Hispanic

Other

Per

cent

(male full-time, full-year workers, 2007)Distribution of Educational Attainment By Race

14


o Plotregion1 – plot1 – bar width set to 0.5 and color set to black o Pletregion2 – plot1 – bar width set to 0.5 and color set to black o Plotregion3 – plot1 – bar width set to 0.5 and color set to black o xaxis1-xaxi4 – axis rule – range/delta checked and set to

minimum value=1, maximum value=4, and range=1 o xaxis3 – label properties – label properties – show labels

checked, use value labels checked, angle set to 45 degrees o yaxis1-yaxis4 – axis rule – range/delta checked and set to

minimum value=0, maximum value=50, and range=10 – labels set horizontal

o Title and subtitle added o Bottom position title and note hidden

15

• Side by Side hist new_ed, discrete percent by(race, col(3))


o Plotregion1 – plot1 – bar width set to 0.75 and color set to black

o Pletregion2 – plot1 – bar width set to 0.75 and color set to black

o Plotregion3 – plot1 – bar width set to 0.75 and color set to black

o xaxis1-xaxi3 axis rule – range/delta checked and set to minimum

value=1, maximum value=4, and range=1 o xaxis3

label properties – label properties – show labels checked, use value labels checked, angle set to 45 degrees

010

2030

4050

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

White, Non-Hispanic Black, Non-Hispanic Hispanic Other

Per

cent

(male full-time, full-yuear workers, 2007)Distribution of Educational Attainment By Race

16

o Title and subtitle added o Bottom position title (“new_ed”) and note hidden

17

Graphical Techniques for Interval Data • Histogram – graphical display which shows the absolute or relative

frequency associated with particular class (intervals) of equal width. o With a histogram we have to determine the class width and

start value or number of bins (classes) and start value, o Start value – the value that the first (left most) class

starts

o Number of bins (classes) – the number of classes Number of observations Number of classes <50 5-7 50-200 7-9 200-500 9-10 500-1,000 10-11 1,000-5,000 11-13 5,000-50,0000 13-17 >50,0000 17-20

o Class width – the width of each class

Class width =(largest value− smalles value)

Number of classes

18

Example 6: Use the CPS ORG data to create a histogram for male hourly wages in 2007 sum wage Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- wage | 72,666 25.18414 14.83449 4.077735 94.0815

The number of observations is larger than 50,000 so by the guidelines we should have 17-20 classes (bins). The formula for the class width is approximately 90/18 which is about 5. We will select a class width of 5 and start at 0. hist wage, start(0) width(5) percent

This is an example of a positively (or right) skewed histogram. The histogram could also be described as unimodal as it has one peak. In contrast a bimodal histogram has two peaks.

05

1015

20P

erce

nt

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

Hourly wages in $2014

(male full-time, full-year workers, 2007)Distribution of Hourly Wages

19


o Plotregion – plot1 – color set to black o Title and subtitle added. o Xaxis1 – title – advanced tab – Y offset set to -3

• Histogram with some grouped data – in the prior example the

relative frequencies for wage categories above $70 are really low. One solution that would allow for more detail for at lower wages, but less at higher wages, is to group the upper classes.

Example 7: Create a histogram with a class width of 5 where wages above $50 are grouped into one catch all category. This allows for more detail at the bottom of the distribution without near zero frequency categories at the top.

egen cwage=cut(wage), at(0(5)90) replace cwage=cond(cwage>=50,50,cwage)

label define cwagel 0 "$0-$5" 5 "$5-$10" /* */ 10 "$10-$15" 15 "$15-$20" /*

*/ 20 "$20-$25" 25 "$25-$30" /* */ 30 "$30-$35" 35 "$35-$40" /* */ 40 "$40-$45" 45 "$45-$50" /* */ 50 "$50-$60" 60 "$60-$70" /*

70 "$70+"

label values cwage cwagel tab cwage hist cwage, discret percent

20

05

1015

20P

erce

nt

$0-$5

$5-$1

0

$10-$

15

$15-$

20

$20-$

25

$25-$

30

$30-$

35

$35-$

40

$40-$

45

$45-$

50

$50-$

60 55

$60-$

70 65$7

0+

Hourly wage in $2014

(male full-time, full-year workers, 2007)Distribution of Hourly Wages

21

• Using histograms to compare distributions of interval variables across groups

Example 8: Make a graph which allows us to compare the distribution of wages across education levels.

hist cwage, discret percent /* */ by(ed_level, col(1))

*** Need to make sure the vertical scales and horizontal scales are the same

010203040

010203040

010203040

010203040

$0-$5

$5-$1

0

$10-$

15

$15-$

20

$20-$

25

$25-$

30

$30-$

35

$35-$

40

$40-$

45

$45-$

50

$50-$

60 55

$60-$

70 65$7

0+

Hourly wages in $2014

High school dropout

High school

Some college


Per

cent

(male full-time, full-year workers, 2007)Distribution of Hourly Wage By Educational Attainment

cond operator-function, drop, and keep - sscc - homegwallace/papers/september 14, 2016.pdf ·...

Documents