cond operator-function, drop, and keep - sscc - homegwallace/papers/september 14, 2016.pdf ·...
TRANSCRIPT
1
PA 818 Professor Wallace September 14, 2016 Lecture:
1. More STATA stuff- the cond operator-function, drop, and keep 2. Descriptive methods for nominal and ordinal data using STATA
(continued) and Excel a. Frequency distributions b. Discrete histograms c. Pie charts
3. Describing relationships between two or more categorical variables a. Cross tabular frequency distribution b. Graphical techniques c. STATA examples, Excel Examples
4. Graphical descriptive techniques for interval and time series data. a. Histograms and frequency distributions b. Stem and leaf display (won’t get to) c. Ogives (O’jive) (won’t get to)
2
Descriptive Techniques for Categorical Data • Frequency distribution – a tabular description for nominal data that
list the number of units associated with each category
• Relative frequency distribution – tabular description for nominal data that list the fraction or percentage of units associated with each category
• Cumulative frequency distribution (ordinal only) – a tabular description for nominal data that list the cumulative (category and below) count, fraction, or percentage of units associated with each category
Example 1: Using the data in the updated CPS ORG file provide a frequency, relative frequency, and cumulative frequency distribution for grouped educational attainment for men in 2007. The education level categories should be less than high school, high school, some college (no degree), and 4 or more years of college. gen ed_level=cond(ed<12,1,0) replace ed_level=cond(ed==12,2,ed_level) replace ed_level=cond(ed>12 & ed<16,3,ed_level) replace ed_level=cond(ed>15,4,ed_level) tab ed_level
ed_level | Freq. Percent Cum. ------------+----------------------------------- 1 | 8,000 11.01 11.01 2 | 24,602 33.86 44.87 3 | 16,237 22.34 67.21 4 | 23,827 32.79 100.00 ------------+----------------------------------- Total | 72,666 100.00
3
Perhaps you are preparing this for your boss and want to provide more information about what the education levels actually mean. In this case we can make use of data labels of the type that already exist the sex and race variables. Let’s create some for ed_level. label define ed_levell 1 "High school dropout" /* */ 2 "High school" /* */ 3 "Some college" /* */ 4 "4 or more years of college" label values ed_level ed_levell ed_level | Freq. Percent Cum. ---------------------------+----------------------------------- High school dropout | 8,000 11.01 11.01 High school | 24,602 33.86 44.87 Some college | 16,237 22.34 67.21 4 or more years of college | 23,827 32.79 100.00 ---------------------------+----------------------------------- Total | 72,666 100.00
4
• Discrete histogram (a type of bar graph) – a graphical representation of a frequency distribution or relative frequency distribution whereby bars are associated with categories and the height of each bar on the graph represents the frequencies or relative frequencies associated with its corresponding category
Example: The same data described using a discrete relative histogram.
Distribution of Educational Attainment (male full-time, full-year workers in 2007)
010
2030
40P
erce
nt
High sc
hool
dropo
ut
High sc
hool
Some c
olleg
e
4 or m
ore ye
ars of
colle
ge
5
• Pie chart – a graphical representation of the relative frequency distribution whereby a circle (or pie) is divided into slices with each slice representing a category and where the size of the slice is proportional to the relative frequency of its associated category.
Example 3: The same relative frequency distribution of educational attainment displayed in pie chart format.
Distribution of Educational Attainment (male full-time, full-year workers in 2007)
High school dropout
High school
Some college
4 or more years of college
23.29%
39.15%
18.22%
19.35%
6
• Deciding between frequency distributions, discrete histograms, and pie charts – there are tradeoffs.
o Graphs and charts verses tables Graphs and charts take up more room that the same
information displayed in tabular format, but they may be easier for some audiences to interpret.
Too many graphs or charts is generally a bad idea. Graphs may be better when providing descriptive
statistics related to one or two data items, rather than many items in a data set.
Tabular displays for nominal data can be integrated into tables which also provide basic descriptive for interval data.
o Discrete histogram versus pie chart In general discrete histograms are more efficient and
flexible than pie charts, but pie charts are probably better in communicating relative frequencies to lay audiences. I hardly every opt for the pie chart (see examples below).
Simple discrete histograms can be printed in black and white whereas pie charts usually need to incorporate some color or texture.
7
8
Table 1: Male Poverty per the Supplemental Poverty Measure, 2013
Group Percentage within Group that is Poor
Percentage of Poor Men
By Age 19-24 years 22.4 21.3 25-34 years 15.2 23.3 35-44 years 13.0 18.4 45-54 years 12.4 18.7 55-64 years 13.3 18.3 65+ years 11.9 4.8 Ages 19-64 By Race/Ethnicity White (non-Hispanic) 10.5 45.1 Black (non-Hispanic) 22.3 17.9 Hispanic 24.6 28.9 Other 16.1 8.2 By Region Northeast 13.2 15.4 Midwest 12.2 16.9 South 14.9 33.5 West 17.9 34.2 By Metro Metro 15.2 87.7 Non-metro 12.5 11.8 Not identifiable 10.8 0.6 By Family Non-family 25.7 30.3 Family without children 12.5 37.5 Family with children 12.4 32.2 By Head Family Type Married couple 9.4 37.9 Cohabitating couple 14.9 9.0 Male-headed family 21.6 11.4 Female-headed family 25.4 11.3 Male nonfamily 25.7 30.3 By Education Level Less than high school 32.2 24.9 High school, no college 17.4 36.4 Some college 13.4 26.4 College+ 6.3 12.3
9
Discrete Histograms in Stata Example 2: Distribution of education level for male full-time, full-year workers in 2007 hist ed_level, discrete percent
* The following edits were made in the STATA’s graph editor to get to the graph shown above:
• Bar Properties o bar width set to 0.5 o color set to black
• xaxis1 title – hidden in advanced tab • xaxis1 properties label properties
o use value labels checked o angle set at 45 degrees o Under range/delta – minimum value set to 1, max value set to
4, and delta set to 1 • Title and Subtitle added
010
2030
40P
erce
nt
High sc
hool
dropo
ut
High sc
hool
Some c
olleg
e
4 or m
ore ye
ars of
colle
ge
(male full-time, full-year workers, 2007)Distribution of Educational Attainment
10
Example 3: Same information in a pie chart with labels graph pie, over(new_ed) plabel(_all name) /* */ plabel(_all percent)
* The following edits were made in STATA’s graph editor to get to the graph shown above:
o Legend – advanced tab – hide legend checked. o The percent and name labels were moved so that they don’t overlap o Title and subtitle added o Pielabel text changed to white
High school dropout
High school
Some college
4 or more years of college
11.01%
33.86%
22.34%
32.79%
(male full-time, full-year workers, 2007)Distribution of Educational Attainment
11
Describing relationships between two categorical variables • Cross tabular frequency distribution – a tabular display that shows
the absolute or relative frequencies associated with all combinations of two nominal variables.
• Stacked or Side-by-Side Discrete Histograms – a graphical display whereby multiple discrete histograms for different groups, but for the same categorical data, are shown stacked or side by side.
Example 4: How does educational attainment vary by race among men in 20007? Using a cross tabulated frequency distribution A cross tabular frequency distribution of race and educational attainment would help us answer this question tab ed_level race, col +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ | Equals 1 if non-Hisp white, 2 if non-Hisp | black, 3 if Hisp, and 4 if other race ed_level | White, No Black, No Hispanic Other | Total ----------------------+--------------------------------------------+---------- High school dropout | 3,369 511 3,790 330 | 8,000 | 6.43 8.91 38.12 7.19 | 11.01 ----------------------+--------------------------------------------+---------- High school | 17,490 2,373 3,488 1,251 | 24,602 | 33.38 41.38 35.09 27.27 | 33.86 ----------------------+--------------------------------------------+---------- Some college | 12,467 1,483 1,409 878 | 16,237 | 23.79 25.86 14.17 19.14 | 22.34 ----------------------+--------------------------------------------+---------- 4 or more years of co | 19,078 1,367 1,254 2,128 | 23,827 | 36.41 23.84 12.61 46.39 | 32.79 ----------------------+--------------------------------------------+---------- Total | 52,404 5,734 9,941 4,587 | 72,666 | 100.00 100.00 100.00 100.00 | 100.00
12
Another way tab race ed_level, row +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ Equals 1 if | non-Hisp white, 2 | if non-Hisp black, | 3 if Hisp, and 4 if | ed_level other race | High scho High scho Some coll 4 or more | Total --------------------+--------------------------------------------+---------- White, Non-Hispanic | 3,369 17,490 12,467 19,078 | 52,404 | 6.43 33.38 23.79 36.41 | 100.00 --------------------+--------------------------------------------+---------- Black, Non-Hispanic | 511 2,373 1,483 1,367 | 5,734 | 8.91 41.38 25.86 23.84 | 100.00 --------------------+--------------------------------------------+---------- Hispanic | 3,790 3,488 1,409 1,254 | 9,941 | 38.12 35.09 14.17 12.61 | 100.00 --------------------+--------------------------------------------+---------- Other | 330 1,251 878 2,128 | 4,587 | 7.19 27.27 19.14 46.39 | 100.00 --------------------+--------------------------------------------+---------- Total | 8,000 24,602 16,237 23,827 | 72,666 | 11.01 33.86 22.34 32.79 | 100.00
13
Example 5: How does educational attainment vary by race? Using discrete histograms
• Stacked
hist new_ed, discrete percent by(race, col(1))
*** We always want to make sure the categories and Y-axis scales are the same.
01020304050
01020304050
01020304050
01020304050
High sc
hool
dropo
ut
High sc
hool
Some c
olleg
e
4 or m
ore ye
ars of
colle
ge
White, Non-Hispanic
Black, Non-Hispanic
Hispanic
Other
Per
cent
(male full-time, full-year workers, 2007)Distribution of Educational Attainment By Race
14
* The following edits were made in STATA’s graph editor to get to the graph shown above:
o Plotregion1 – plot1 – bar width set to 0.5 and color set to black o Pletregion2 – plot1 – bar width set to 0.5 and color set to black o Plotregion3 – plot1 – bar width set to 0.5 and color set to black o xaxis1-xaxi4 – axis rule – range/delta checked and set to
minimum value=1, maximum value=4, and range=1 o xaxis3 – label properties – label properties – show labels
checked, use value labels checked, angle set to 45 degrees o yaxis1-yaxis4 – axis rule – range/delta checked and set to
minimum value=0, maximum value=50, and range=10 – labels set horizontal
o Title and subtitle added o Bottom position title and note hidden
15
• Side by Side hist new_ed, discrete percent by(race, col(3))
* The following edits were made in STATA’s graph editor to get to the graph shown above:
o Plotregion1 – plot1 – bar width set to 0.75 and color set to black
o Pletregion2 – plot1 – bar width set to 0.75 and color set to black
o Plotregion3 – plot1 – bar width set to 0.75 and color set to black
o xaxis1-xaxi3 axis rule – range/delta checked and set to minimum
value=1, maximum value=4, and range=1 o xaxis3
label properties – label properties – show labels checked, use value labels checked, angle set to 45 degrees
010
2030
4050
High sc
hool
dropo
ut
High sc
hool
Some c
olleg
e
4 or m
ore ye
ars of
colle
ge
High sc
hool
dropo
ut
High sc
hool
Some c
olleg
e
4 or m
ore ye
ars of
colle
ge
High sc
hool
dropo
ut
High sc
hool
Some c
olleg
e
4 or m
ore ye
ars of
colle
ge
High sc
hool
dropo
ut
High sc
hool
Some c
olleg
e
4 or m
ore ye
ars of
colle
ge
White, Non-Hispanic Black, Non-Hispanic Hispanic Other
Per
cent
(male full-time, full-yuear workers, 2007)Distribution of Educational Attainment By Race
16
o Title and subtitle added o Bottom position title (“new_ed”) and note hidden
17
Graphical Techniques for Interval Data • Histogram – graphical display which shows the absolute or relative
frequency associated with particular class (intervals) of equal width. o With a histogram we have to determine the class width and
start value or number of bins (classes) and start value, o Start value – the value that the first (left most) class
starts
o Number of bins (classes) – the number of classes Number of observations Number of classes <50 5-7 50-200 7-9 200-500 9-10 500-1,000 10-11 1,000-5,000 11-13 5,000-50,0000 13-17 >50,0000 17-20
o Class width – the width of each class
Class width =(largest value− smalles value)
Number of classes
18
Example 6: Use the CPS ORG data to create a histogram for male hourly wages in 2007 sum wage Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- wage | 72,666 25.18414 14.83449 4.077735 94.0815
The number of observations is larger than 50,000 so by the guidelines we should have 17-20 classes (bins). The formula for the class width is approximately 90/18 which is about 5. We will select a class width of 5 and start at 0. hist wage, start(0) width(5) percent
This is an example of a positively (or right) skewed histogram. The histogram could also be described as unimodal as it has one peak. In contrast a bimodal histogram has two peaks.
05
1015
20P
erce
nt
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Hourly wages in $2014
(male full-time, full-year workers, 2007)Distribution of Hourly Wages
19
* The following edits were made in STATA’s graph editor to get to the graph shown above:
o Plotregion – plot1 – color set to black o Title and subtitle added. o Xaxis1 – title – advanced tab – Y offset set to -3
• Histogram with some grouped data – in the prior example the
relative frequencies for wage categories above $70 are really low. One solution that would allow for more detail for at lower wages, but less at higher wages, is to group the upper classes.
Example 7: Create a histogram with a class width of 5 where wages above $50 are grouped into one catch all category. This allows for more detail at the bottom of the distribution without near zero frequency categories at the top.
egen cwage=cut(wage), at(0(5)90) replace cwage=cond(cwage>=50,50,cwage)
label define cwagel 0 "$0-$5" 5 "$5-$10" /* */ 10 "$10-$15" 15 "$15-$20" /*
*/ 20 "$20-$25" 25 "$25-$30" /* */ 30 "$30-$35" 35 "$35-$40" /* */ 40 "$40-$45" 45 "$45-$50" /* */ 50 "$50-$60" 60 "$60-$70" /*
70 "$70+"
label values cwage cwagel tab cwage hist cwage, discret percent
20
05
1015
20P
erce
nt
$0-$5
$5-$1
0
$10-$
15
$15-$
20
$20-$
25
$25-$
30
$30-$
35
$35-$
40
$40-$
45
$45-$
50
$50-$
60 55
$60-$
70 65$7
0+
Hourly wage in $2014
(male full-time, full-year workers, 2007)Distribution of Hourly Wages
21
• Using histograms to compare distributions of interval variables across groups
Example 8: Make a graph which allows us to compare the distribution of wages across education levels.
hist cwage, discret percent /* */ by(ed_level, col(1))
*** Need to make sure the vertical scales and horizontal scales are the same
010203040
010203040
010203040
010203040
$0-$5
$5-$1
0
$10-$
15
$15-$
20
$20-$
25
$25-$
30
$30-$
35
$35-$
40
$40-$
45
$45-$
50
$50-$
60 55
$60-$
70 65$7
0+
Hourly wages in $2014
High school dropout
High school
Some college
4 or more years of college
Per
cent
(male full-time, full-year workers, 2007)Distribution of Hourly Wage By Educational Attainment