stata training manual

Upload: getachew-a-abegaz

Post on 14-Oct-2015

318 views

Category:

Documents


21 download

TRANSCRIPT

  • Introduction to Stata

  • 1

    ContentsSECTION 1: INTRODUCTION TO STATA ................................................................................ 2SECTION 2: EXPLORING DATA FILES .................................................................................... 5SECTION 3: STORING COMMANDS AND OUTPUT ............................................................ 18SECTION 4: CREATING NEW VARIABLES ........................................................................... 23SECTION 5: MODIFYING VARIABLES .................................................................................. 31SECTION 6: ADVANCED DESCRIPTIVE STATISTICS ........................................................ 35SECTION 7: PRESENTING DATA WITH GRAPH (GRAPHING DATA) .............................. 40SECTION 8: NORMALITY AND OUTLIER ............................................................................. 42SECTION 9: STATISTICAL TESTS .......................................................................................... 53SECTION 10: LINEAR REGRESSION ...................................................................................... 56SECTION 11: LOGISTIC REGRESSION ................................................................................... 67SECTION 12: PANEL DATA ANALYSIS ................................................................................ 72SECTION 13: DATA MANAGEMENT..................................................................................... 74SECTION 14: ADVANCED PROGRAMMING ......................................................................... 80SECTION 15: TROUBLESHOOTING AND UPDATE ............................................................. 82

  • 2

    SECTION1:INTRODUCTIONTOSTATA Stata is a package that offers a good combination of ease to learn and power. It has numerous powerful yet simple commands for data management, which allows users to perform complex manipulations with ease. Under Stata/SE, one can have up to 32,768 in a Stata data file and 11,000 for any estimation commands. Stata performs most general statistical analyses (regression, logistic regression, ANOVA, factor analysis, and some multivariate analysis). The greatest strengths of Stata are probably in regression and logistic regression. Stata also has a very nice array of robust methods that are very easy to use, including robust regression, regression with robust standard errors, and many other estimation commands include robust standard errors as well. Stata has the ability to easily download programs developed by other users and the ability to create your own Stata programs that seamlessly become part of Stata. One can find many cutting edge statistical procedures written by other users before and incorporate them into his/her own Stata program. Stata uses one line commands which can be entered one command at a time or can be entered many at a time in a Stata program. When you open Stata, you will see a menu bar across the top, a tool bar with buttons, and 3-5 windows (the number of windows open depends on which windows were open the last time Stata was used). Each is described briefly below. The Stata Interface

    1. Windows

    The Stata windows give you all the key information about the data file you are using, recent commands, and the results of those commands. Some of them open automatically when you start Stata, while others can be opened using the Windows pull-down menu or the buttons on the tool bar. These are the Stata windows:

    Stata Results To see recent commands and output Stata Command To enter a command Stata Browser To view the data file (needs to be opened) Stata Editor To edit the data file (needs to be opened) Stata Viewer To get help on how to use Stata Variables To see a list of variables Review To see recent commands Stata Do-file Editor To write or edit a program (needs to be opened)

  • 3

    The Command window on the bottom right is where you'll enter commands. When you press ENTER, they are pasted into the Stata Results window above, which is where you will see your commands executed and view the results. You can also use recent commands again by using the PageUp key (to go to the previous command) and Page Down key (to go to the next command). The Result Window (with the black background) shows all recent commands, output, error messages, and help. The text is color-coded as follows:

    Green General information and the frame and headings of output tables blue Commands or error messages that can be clicked on for more information white Stata commands yellow Numbers in output tables red Error messages

    The slide bar on the right side can be used to look at earlier results that are not on the screen. However, unlike SPSS, the Stata results window does not keep all output generated. It will keep about 300-600 lines of the most recent output, deleting earlier output. If you want to store output in a file, you must use the log command. /More on this latter/ Stata Browser This window shows all the data in memory. The Stata Browser does not appear automatically when you start Stata. The only way to open the Browser is to click on the buttom with a table and magnifying glass. Unlike SPSS, when the Stata Browser is open, you cannot execute any commands, either from the Stata Command window or from the Do-file Editor. In addition, you also cannot change any of the data. You can, however, sort the data or hide certain variables using buttons at the top of the Stata Browser window. Stata Editor This window is exactly like the Stata Browser window except that you can change the data. We do not recommend using this window because you will have no record of the

  • 4

    changes you make in the data. It is better to correct errors in the data using a Do-file program that can be saved. Stata Viewer This window provides help on Stata commands and rules. To open the Stata Viewer window, you can click on Windows/Viewer or click on the eye button on the tool bar. To use the Stata Viewer window, type a command in the space at the top and the Viewer will give you the purpose and rules for using that command, along with some examples. Any blue text in the Viewer can be clicked on for more information about that command. Variables This window (tall with a white background) lists all the variables that exist in memory. When you open a Stata data file, it lists the variables in the file. If you create new variables, they will be added to the list of variables. If you delete variables, they will be removed from the list. You can insert a variable into the Stata Command window by clicking on it in the Variables window. Do-file Editor This window allows you to write, edit, save, and execute a Stata program. A Stata program (or Do-file) is simply a set of Stata commands written by the user. The advantage of using the Do-file Editor rather than the Stata Command window is that the Do-file allows you to save, revise, and rerun a set of commands. Exploratory analysis of the data can be done with the Stata Command window, but any serious data analysis should be carried out using the Do-file Editor, not the Stata Command window. The Do-File Editor can be opened by clicking on Windows/Do-file Editor or by clicking on the envelope button. With so many windows, it is sometimes difficult to fit them all on the screen. You can adjust the size and position of each window the way you like it and then save the layout by clicking on Prefs/Save Windowing Preferences. Each time you open Stata, the windows will be arranged according to your prefered layout. On the right are two convenient windows. The Variables window keeps a list of your current variables. If you click on one of them, its name will be pasted into the current command at the location of the cursor, which saves a little typing. The Review window keeps a list of all the commands you've typed in the Stata session. Click on one, and it will be pasted into the command window, which is handy for fixing typos. Double-click, and the command will be pasted and re-executed. You can also export everything in the Review window into a .do file (more on them later) so you can run the exact same commands at any time. To do this right-click the Review window. When we first open Stata, all these windows are blank except for the Stata Results window. You can resize these 4 windows independently, and you can resize the outer window as well. To save your window size changes, click on Prefs button, then Save Windowing Preferences Entering commands in Stata works pretty much like you expect. BACKSPACE deletes the character to the left of the cursor, DELETE the character to the right, the arrow keys move the cursor around, and if you type the text is inserted at the current location of the cursor. The up arrow does not retrieve previous commands, but you can do that by pressing PAGE UP, or CTRL-R, or by using the Review window.

  • 5

    2. Menus

    Stata displays 9 drop-down menus across the top of the outer window, from left to right: File

    Open open a Stata data file (use) Save/Save as save the Stata data in memory to disk Do execute a do-file Filename copy a filename to the command line Print print log or graph Exit quit Stata

    Edit Copy/Paste copy text among the Command, Results, and Log windows Copy Table copy table from Results window to another file Table copy options what to do with table lines in Copy Table

    Prefs Various options for setting preferences. For example, you can save a particularly layout of the different Stata windows or change the colors used in Stata windows.

    Data Graphics Statistics build and run Stata commands from menus User menus for user-supplied Stata commands (download from Internet) Window bring a Stata window to the front Help Stata command syntax and keyword searches

    3. Button bar

    The buttons on the button bar are from left to right (equivalent command is in bold): Open a Stata data file: use Save the Stata data in memory to disk: save Print a log or graph Open a log, or suspend/close an open log: log Open a new viewer Bring Results window to front Bring Graph window to front New Dofile Editor: doedit Edit the data in memory: edit Browse the data in memory: browse Scroll another page when --more-- is displayed: Space Bar Stop current command or do-file: Ctrl-Break

    SECTION2:EXPLORINGDATAFILES 2.1. Common Stata Syntax This section covers commands that are used for preliminary exploration of data in a file. Stata commands follow the same syntax: [by varilist1:] command [varlist2] [if exp] [in range] [weight], [options]

  • 6

    Items inside of the squares brackets are either options or not available for every command. This syntax applies to all Stata commands. In order to use by prefix, the dataset must first be sorted on the by variable(s). it helps to repeat Stata command on subsets of the data. Logical operators used in Stata

    ~ Not == Equal ~= not equal != not equal > greater than >= greater than or equal < less than

  • 7

    Stata to estimate incorrect values of the variance and standard errors of estimates, and p-values for hypothesis tests. 4. iweight, or importance weights, are weights that indicate the "importance" of the observation in some vague sense. iweights have no formal statistical definition; any command that supports iweights will define exactly how they are treated. In most cases, they are intended for use by programmers who who need to implement their own analytical techniques by using some of the available estimation commands. Special care should be taken when using importance weights to understand how they are used in the formulas for estimates and variance. This information is available in the Methods and Formulas section in the Stata manual for each estimation command. In general, these formulas will be incorrect for computing the variance for data from a sample survey. 2.2 Examining dataset

    clear

    The clear command deletes all files, variables, and labels from the memory to get ready to use a new data file. You can clear memory using the clear command or by using the clear up command as part of the use command (see the use command). This command does not delete any data saved to the hard-drive.

    set memory First you can check to see how much memory is allocated to hold your data using the memory command. For instance, we are now running StataSE 9 under Windows, and this is what the memory command told us.

  • 8

    Figure 2: Working memory space . memory bytes -------------------------------------------------------------------- Details of set memory usage overhead (pointers) 5,808 0.06% data 107,448 1.02% ---------------------------- data + overhead 113,256 1.08% free 10,372,496 98.92% ---------------------------- Total allocated 10,485,752 100.00% -------------------------------------------------------------------- Other memory usage set maxvar usage 1,816,666 set matsize usage 1,315,200 programs, saved results, etc. 3,338 --------------- Total 3,135,204 ------------------------------------------------------- Grand total 13,620,956 We have 11MB free for reading in a data file. Whenever we want to read data file bigger than this free bytes, we will get the error message read as: no room to add more observations r(901); In this case I have to allocate to more memory, say 25MB (if 25MB are sufficient for current file), with the set memory command before trying to use my file.

    set memory 25m Figure 3: Current memory allocation after set memory 25m command Current memory allocation current memory usage settable value description (1M = 1024k) -------------------------------------------------------------------- set maxvar 5000 max. variables allowed 1.733M set memory 25M max. data space 25.000M set matsize 400 max. RHS vars in models 1.254M ----------- 27.987M Now that we have allocated enough memory, we will be able to read bigger files provided that it is within the specified memory spaces. After setting the memory space to 25m, we have information on memory space read us:

  • 9

    Figure 4: Adjusted working memory space . memory bytes -------------------------------------------------------------------- Details of set memory usage overhead (pointers) 5,808 0.02% data 107,448 0.41% ---------------------------- data + overhead 113,256 0.43% free 26,101,136 99.57% ---------------------------- Total allocated 26,214,392 100.00% -------------------------------------------------------------------- Other memory usage set maxvar usage 1,816,666 set matsize usage 1,315,200 programs, saved results, etc. 1,778 --------------- Total 3,133,644 ------------------------------------------------------- Grand total 29,348,036 If we want to allocate 25m (250 megabytes) every time we start Stata, We can type; . set memory 250m, permanently And then Stata will allocate this amount of memory every time we start Stata.

    use This command opens an existing Stata data file. The syntax is: use filename [, clear ] opens new file use [varlist] [if exp] [in range] using filename [, clear ] opens selected parts of file

    If there is no extension, Stata assumes it is .dta. If there is no path, Stata assumes it is in the current folder. You can use a path name such as: use C:\...\ERHScons1999 If the path name has spaces, you must use double quotes: use .d:\my

    data\ERHScons1999. You can open selected variables of a file using a variable list. You can open selected records of a file using if or in.

    Here are some examples of the use command:

    use ERHScons1999 opens the file ERHScons1999.dta for analysis. use ERHScons1999 if q1a == 1 opens data from region 1 use ERHScons1999 in 5/25 opens records 5 through 25 of file use hhid hhsize cons using ERHScons1999 opens 3 variables from ERHScons1999 file use C:\training\ ERHScons1999 opens the file ERHScons1999.dta in the specified folder use .C:\data files\ ERHScons1999 use quotation marks if there are spaces use ERHScons1999, clear clears memory before opening the new file

  • 10

    While running Do-file program, we have to use use and clear command at the same time. For instance, here we load a raw data set from ERHScons1999. The clear option then allows Stata to clear the memory of previous data set in order to load the new one. . use C:\...\ERHScons1999.dta, clear As Stata did not want you to lose the changes that you made to the data setting in memory. If you really want to discard the changes in memory, clear option specifies that it is okay to replace the data in memory, even though the current data have not been saved to disk.

    save The save command will save the dataset as a .dta file under the name you choose. Editing the dataset changes data in the computer's memory, it does not change the data that is stored on the computer's disk. . save C:\...\consumption.dta, replace The replace option allows you to save a changed file to the disk, replacing the original file. Stata is worried that you will accidentally overwrite your data file. You need to use the replace option to tell Stata that you know that the file exists and you want to replace it.

    edit This command use to open window called data editor window that allow us to view all observation in the memory. You can change the data using data editor window but you do not recommend using this window because you will have no record of the changes you make in the data. It is better to correct errors in the data using a Do-file program that can be saved (we will see Do-file program latter).

    browse This window is exactly like the Stata editor window except that you cant change the data. Note: Unlike SPSS, when the Stata Editor or Browser is open, you cannot execute any commands, either from the Stata Command window or from the Do-file Editor. In addition, you also cannot change any of the data. You can, however, sort the data or hide certain variables using buttons at the top of the Stata Browser window.

    describe This command provides a brief description of the data file. You can use des or d and Stata will understand. The output includes:

    the number of variables the number of observations (records) the size of the file the list of variables and their characteristics

  • 11

    Example 1: Using describe to show information about a data file . des Contains data from C:\training\ERHSCONS1999.dta obs: 1,452 vars: 15 24 Feb 2007 07:07 size: 113,256 (98.9% of memory free) (_dta has notes) ----------------------------------------------------------------------------- storage display value variable name type format label variable label ----------------------------------------------------------------------------- q1a float %9.0g reg Region q1b double %15.0g w Wereda q1c double %17.0g pa Peseant association q1d double %12.0g Household id sexh byte %8.0g sexhh Sex of household head ageh float %9.0g p1s1q4 Age of household head cons float %9.0g consumption per month food float %9.0g food cons per month hhsize byte %8.0g household size aeu float %9.0g adult equivalent units in household fpi float %9.0g food price index rconspc float %9.0g real consumption per capita 1994 prices rconsae float %9.0g real consumption per adult 1994 prices poor double %8.2f hhid double %12.0f selected household unique id ----------------------------------------------------------------------------- Sorted by: hhid It also provides the following information on each variable in the data file:

    the variable name the storage type: byte is used for binary variables, int is used for integers, and is used for continuous variables that may have decimals. To see the limits on each storage type, type help datatypes.

    the display type indicates how it will appear in the output. the value label is the name of a set of labels for different values the variable label is a name for the variable that is used in output.

    list

    This command lists values of variables in data set. The syntax is: list [varlist] [if exp] [in range] With varlist, you can specify which variables values will be presented. If list is not specified, all variables will be listed. With if and in, you can specify which records will be listed. Here are some examples:

    . list lists entire dataset

    . list in 1/10 lists observations 1 through 10

  • 12

    . list hhsize q1a food lists selected variables

    . list hhsize sex in 1/20 lists observations 1-20 for selected variables

    . list if q1a < 6 lists cases in region is 1 through 5

    if

    This command is used to select certain records in carrying out a command. This is similar to the process if command in SPSS, except that in Stata it is not considered a separate command. The syntax is:

    command if exp

    Examples include: . list hhid q1a food if food>12000 lists data if food is above 12000 . tab q1a if cons>10000 &cons=1200 browse data if food consumption is above 12000 Note that if statements always use ==, not a single =. Also note that | indicates or while & indicates and.

    in We have also used in to select records based on the case number. The syntax is:

    command in exp

    For example: . list in 10 list observation number 10 . summarize in 10/20 summarize observations 10-20

    Example 2: Using list to look at data . list hhid q1a q1b q1c q1d hhsize rconspc in 10/25 +-------------------------------------------------------------------+ | hhid q1a q1b q1c q1d hhsize rconspc | |-------------------------------------------------------------------| 10. | 101010000010 Tigray Atsbi Haresaw 10 4 134.5961 | 11. | 101010000011 Tigray Atsbi Haresaw 11 3 168.9437 | 12. | 101010000012 Tigray Atsbi Haresaw 12 3 135.1815 | 13. | 101010000013 Tigray Atsbi Haresaw 13 7 102.3454 | 14. | 101010000014 Tigray Atsbi Haresaw 14 9 68.04964 | |-------------------------------------------------------------------| 15. | 101010000015 Tigray Atsbi Haresaw 15 12 49.61188 | 16. | 101010000016 Tigray Atsbi Haresaw 16 4 85.05015 | 17. | 101010000017 Tigray Atsbi Haresaw 17 5 84.72104 | 18. | 101010000018 Tigray Atsbi Haresaw 18 2 95.42028 | 19. | 101010000019 Tigray Atsbi Haresaw 19 10 140.7843 | |-------------------------------------------------------------------| 20. | 101010000020 Tigray Atsbi Haresaw 20 3 80.58356 | 21. | 101010000021 Tigray Atsbi Haresaw 21 3 95.98959 | 22. | 101010000022 Tigray Atsbi Haresaw 22 5 68.05075 | 23. | 101010000023 Tigray Atsbi Haresaw 23 4 52.4964 | 24. | 101010000024 Tigray Atsbi Haresaw 24 3 91.86269 | |-------------------------------------------------------------------| 25. | 101010000025 Tigray Atsbi Haresaw 25 5 149.1702 | +-------------------------------------------------------------------+

  • 13

    . list q1a cons aeu poor in 200/215 +----------------------------------+ | q1a cons aeu poor | |----------------------------------| 200. | Amhara 661.3979 1.82 0.00 | 201. | Amhara 321.7693 8.14 1.00 | 202. | Amhara 169.784 2.3 0.00 | 203. | Amhara 907.9995 3.14 0.00 | 204. | Amhara 232.6273 4.148 1.00 | |----------------------------------| 205. | Amhara 432.4525 6.86 1.00 | 206. | Amhara 59.53 1.46 1.00 | 207. | Amhara 228.22 3.4 0.00 | 208. | Amhara 1298.875 5.44 0.00 | 209. | Amhara 144.494 3.48 1.00 | |----------------------------------| 210. | Amhara 266.974 4.28 0.00 | 211. | Amhara 43.97179 .74 1.00 | 212. | Amhara 216.0467 3.408 1.00 | 213. | Amhara 492.4958 2.94 0.00 | 214. | Amhara 437.7144 2.46 0.00 | |----------------------------------| 215. | Amhara 166.354 1.74 0.00 | +----------------------------------+ If you are not careful with list, you will get a lot more output than you want. If Stata starts giving you more output than you really want, use the stop button (red button with an X).

    codebook The codebook command is a great tool for getting a quick overview of the variables in the data file. It produces a kind of electronic codebook from the data file, displaying information about variables' names, labels and values.

  • 14

    Example 3: using codebook to look at data . codebook sexh Sex of household head ---------------------------------------------------------------------------- type: numeric (byte) label: sexhh range: [0,1] units: 1 unique values: 2 missing .: 0/1452 tabulation: Freq. Numeric Label 400 0 Female 1052 1 Male .codebook rconspc real consumption per capita 1994 prices ----------------------------------------------------------------------------- type: numeric (float) range: [4.2201104,1018.2954] units: 1.000e-07 unique values: 1448 missing .: 3/1452 mean: 90.3674 std. dev: 81.9962 percentiles: 10% 25% 50% 75% 90% 25.1043 39.9402 65.9926 114.253 180.891

    inspect It is another useful command for getting a quick overview of a data file. inspect command displays information about the values of variables and is useful for checking data accuracy. Example 4: Using inspect to look at data . inspect sexh sexh: Sex of household head Number of Observations ---------------------------- Non- Total Integers Integers | # Negative - - - | # Zero 400 400 - | # Positive 1052 1052 - | # ----- ----- ----- | # # Total 1452 1452 - | # # Missing - +---------------------- ----- 0 1 1452 (2 unique values) sexh is labeled and all values are documented in the label.

  • 15

    count count command can be used to show the number of observations that satisfying if options. If no conditions are specified, count displays the number of observations in the data. . count 1452 . count if q1a==3 466 2.3. Preliminary Descriptive Statistics

    tabulate, tab1, tab2

    These are three related commands that produce frequency tables for discrete variables. They can produce one-way frequency tables (tables with the frequency of one variable) or two-way frequency tables (tables with a row variable and a column variable. These commands are similar to the frequency and crosstab commands in SPSS. How do they differ?

    tabulate or tab produce a frequency table for one or two variables tab1 produces a one-way frequency table for each variable in the

    variable list tab2 produces all possible two-variable tables from the list of variables

    You can use several options with these commands:

    all gives all the tests of association for two-way tables cell gives the overall percentage for two-way tables column gives column percentages for two-way tables row gives row percentages for two-way tables nofreq suppresses printing the frequencies. chi2 provides the chi squared test for two-way tables

    There are many other options, including other statistical tests. For more information, type help tabulate Some examples of the tabulate commands are: . tabulate q1a produces table of frequency by region . tabulate q1a sexh produces a cross-tab of frequencies by region and sex of head . tabulate q1a hhsize, row produces a cross-tab by region and hhsize with row percentages . tabulate sexh hhsize, cell nofreq produces a cross-tab of overall percent by sex and hhsize. . tab1 q1a q1b hhsize produces three tables, a frequency table for each variable . tab2 q1a poor sexh produces three tables, a cross-tab of each pair of variables

  • 16

    Example 5: Using tabulate on categorical variables . tab q1b Wereda | Freq. Percent Cum. ----------------+----------------------------------- Atsbi | 84 5.79 5.79 Sebhassahsie | 66 4.55 10.33 Ankober | 86 5.92 16.25 Basso na Worana | 175 12.05 28.31 Enemayi | 61 4.20 32.51 Bugena | 144 9.92 42.42 Adaa | 95 6.54 48.97 Kersa | 95 6.54 55.51 Dodota | 109 7.51 63.02 Shashemene | 97 6.68 69.70 Cheha | 65 4.48 74.17 Kedida Gamela | 74 5.10 79.27 Bule | 134 9.23 88.50 Boloso | 96 6.61 95.11 Daramalo | 71 4.89 100.00 ----------------+----------------------------------- Total | 1,452 100.00 . tab q1b sexh | Sex of household head Wereda | Female Male | Total ----------------+----------------------+---------- Atsbi | 48 36 | 84 Sebhassahsie | 29 37 | 66 Ankober | 13 73 | 86 Basso na Worana | 52 123 | 175 Enemayi | 11 50 | 61 Bugena | 55 89 | 144 Adaa | 23 72 | 95 Kersa | 31 64 | 95 Dodota | 26 83 | 109 Shashemene | 26 71 | 97 Cheha | 22 43 | 65 Kedida Gamela | 15 59 | 74 Bule | 11 123 | 134 Boloso | 25 71 | 96 Daramalo | 13 58 | 71 ----------------+----------------------+---------- Total | 400 1,052 | 1,452

    In one-way tables, Stata gives the count, the percentage, and the cumulative percentage (see first example in box).

    In two-way tables, Stata gives the count only, unless you ask for other statistics (see second example in box)

    col, row, and cell request Stata to include percentages in two-way tables

    summarize The summarize command produces statistics on continuous variables like age, food, cons hhsize. The syntax looks like this:

  • 17

    summarize [varlist] [if exp] [in range] [, [detail]] By default, it produces the following statistics:

    Number of observations Average (or mean) Standard deviation Minimum Maximum

    If you specify detail Stata gives you additional statistics, such as skewness, kurtosis, the four smallest values the four largest values various percentiles.

    Here are some examples:

    . summarize gives statistics on all variables

    . summarize hhsize food gives statistics on selected variables

    . summarize hhsize cons if q1a==3 gives statistics on two variables for one region Example 6: Using summarize to study continuous variables . sum rconspc rconsae hhsize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- rconspc | 1449 90.36742 81.99623 4.22011 1018.295 rconsae | 1449 108.7874 97.27053 4.811201 1212.256 hhsize | 1452 5.782369 2.740968 1 17 . sum rconspc rconsae hhsize if q1a==4 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- rconspc | 395 111.6185 99.09839 8.393298 1018.295 rconsae | 395 132.6018 116.6133 9.608795 1212.256 hhsize | 396 6.209596 2.853203 1 16 The first example gives the statistics for the whole sample, while the second gives the statistics only for households in Region 4.

    by This prefix goes before a command and asks Stata to repeat the command for each value of a variable. The general syntax is:

    by varlist: command Note: bysort command is most commonly used to shorten the sorting process

  • 18

    Some examples of the by prefix are: bysort sex: sum rconsae for sex of hh head, give stats on real per capita

    consumption. Example 7: Using the by prefix -> sexh = Female Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- rconspc | 398 100.2183 89.18895 7.068164 624.1437 -------------------------------------------------------------------------- -> sexh = Male Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- rconspc | 1051 86.63701 78.82594 4.22011 1018.295

    help The help command gives you information about any Stata command or topic

    help [command] For example, . help tabulate gives a description of the tabulate command . help summarize gives a description of the summarize command

    SECTION3:STORINGCOMMANDSANDOUTPUT In this section, we discuss how to store commands and output for later use. First, we describe how to store commands using a program (Stata calls it a Do-file), how to edit the program, and how to run it. Second, we present different ways of saving and using the output generated by Stata. The following topics are covered:

    Using the Do-file Editor log using log off log on log close set logtype to move tables from Stata to Word and Excel

    Using the Do-file Editor The Do-file Editor allows you to store a program (a set of commands) so that you can edit it and execute it later. Why use the Do-file Editor?

    It makes it easier to check and fix errors, It allows you to run the commands later,

  • 19

    It lets you show others how you got your result, and It allows you to collaborate with others on the analysis.

    In general, any time you are running more than 10 commands to get a result, it is easier and safer to use a Do-file to store the commands. To open the Do-file Editor, you can click on Windows/Do-file Editor or click on the envelope on the Tool Bar. Within the Do-file Editor, there is a menu bar and tool bar buttons to carry out a variety of editing functions. The menu bar is similar to the one in Microsoft Word:

    File/New to open a new, blank Do-file File/Open to open an existing Do-file File/Save to save the current Do-file File/Save as to saving the current Do-file under a new name File/Insert file to insert another file into the current one File/Print to print the Do-file File/Close to close the Do-file Edit/Undo to undo the last command Edit/Cut to delete or move the marked text in the Do-file Edit/Copy to copy the marked text in the Do-file Edit/Paste to insert the copied or cut text into the Do-file Search/Find to find a word or phrase in the Do-text Search/Replace to find and replace a word or phrase in the Do-file Tools/Do to execute all the commands or the marked commands in the Do-file Tools/Run to execute all the commands or the marked commands in the Do-file

    without showing any output in the Stata Results window The tool bar buttons can be used to carry out some of these tasks more quickly. For example, there are buttons for File/New, File/Open, File/Print, Search/Find, Edit/Cut, Edit/Copy, Edit/Paste, Edit/Undo, Do, and Run. Probably the button you will use most is the second-to-last one that shows a page with text on it. This is the Do button for executing the program or the marked part of the program. Finally, the keyboard commands may be even quicker to use than the buttons. The most useful keyboard commands are:

    Control-O Open file Control-S Save file Control-C Copy Control-X Cut Control-V Paste Control-Z Undo Control-F Find Control-H Find and Replace

    To run the commands in a Do-file, you can click on the Do button (the second-to-last one) or click on Tools/Do. If you want to run one or just a few commands rather than the whole file,

  • 20

    mark the commands and click on the Do button. You do not have to mark the whole command, but at least one character in the command must be marked in order for the command to be executed (unlike SPSS, it is not enough to have the cursor on a command). Although layout is a matter of personal preference, it may be useful to have the Stata Results window and the other windows on one side of the screen and the Do-file Editor window on the other. This makes it easy to switch back and forth. When you arrange the windows the way you like, you can save the layout by clicking Prefs/Save Windowing Preferences. Each time you open Stata, it will use your chosen layout. Note: If you would like to add a note to a do file, but do not want Stata to execute your notes, /* */ is used. /* This Stata program illustrates how to read create a do file */ log using C:\...\eeatraining.log,replace log close Saving the Output As mentioned in earlier section, the Stata Results window does not keep all the output you generate. It only stores about 300-600 lines, and when it is full, it begins to delete the old results as you add new results. You can increase the amount of memory allocated to the Stata Results Window. But even this will probably not be enough for a long session with Stata. Thus, we need to use log to save the output. There are four ways to control the log operations.

    1. You can use the log button on the tool bar. It looks like a scroll. 2. You can click on File/Log to get four options: Begin (log using), Close, Suspend (log

    off), and resume (log on). 3. You can use .log. commands in the Stata Command window 4. You can use .log. commands in the Stata Do-file Editor.

    In this section, we describe the commands, which can be used in the Stata Command window or in a do-file (program). log using This command creates a file with a copy of all the commands and output from Stata. The first time you open a log, you must give a name to the new file to be created. The syntax is:

    log using filename [, append replace [ text | smcl ] ] where filename is that name you give the new file. The options are:

    append adds the output to an existing file replace replaces an existing file with the output text tells Stata to create the log file in text (ASCII) format smcl tells Stata to create the log file in SMCL format

    Here are some examples:

    log using temp22 saves output to a file called temp22

  • 21

    log using temp20, replace saves output to an existing file, temp20, replacing content log using regoutput, append saves output to an existing file, results, adding to contents log using .d:\my data\myfile.txt. saves output in specified file in specified folder

    Several points should be remembered in using this command:

    if you use an existing file name but do not say replace or append, Stata will give an error message that the file already exists

    log files in text format can be opened with Wordpad, Notepad, the DOS editor, or any word processor., but the file does not have any formatting

    smcl files have formatting (bold, colors, etc) but can only be opened with Stata smcl format is the default

    log off This command temporarily turns off the logging of output, so that any subsequent output is not copied to the log file. This is useful if you want to save some of the output but not all. Log off only works after a log using command. log on This command is used to restart the logging, copying any new output to the log file that was already defined. log on only works after a log using and a log off command. log close This command is used to turn off the logging and save the file. How are log off and log close different? Log off allows you to turn it back on easily with log on continuing to use the same log file. After a log close however, the only way to start logging again is with log using. set logtype text This command tells Stata to always save the log files in text (ASCII) format. It is the same as adding the text subcommand to every log using command, but it is easier. If you prefer text format log files, this is the best way to make sure all the log files are in this format. set logtype smcl This command tells Stata to always save log files in SMCL format. It is the same as adding the smcl subcommand to every log using command. Exercise 1: Exploring the ERHS This section includes some questions that you can answer using the r5ERHS files provided on your computer and the commands described in this section. Remember two tricks to make it easier to fix your mistakes: You can use PageUp to retrieve the most recent command.

  • 22

    You can click on variables in the Variable window to paste it into the Command window. Summary file The file ERHScons1999 contains summary variables calculated from various other data files. It is at the household level. Open the file by entering use C:\training\ERHScons1999.dta, clear in the Command window and pressing Return. Open do and log files to save command and outputs. Use log file and copy and paste some of output tables into excel and word files.

    1. How many variables and how many records are in ERHScons1999? 2. What percentages of households have female heads? 3. Is there a statistically significant difference between the percentage of female-headed

    households in poor and non-poor? 4. What percentage of Amhara households are considered poor household? 5. What percentages of households are in SNNP region? 6. How does the percentage of female headed household vary by region? 7. What is the average size of a household? 8. What is the average size of household in the Oromia region? 9. How does household size vary with across status? (use poor variable)

    Household members The file p1sec1_rv1 contains information about each member of the household. It is at the individual level (each record is a person). You can answer the following questions using this file:

    1. What percentage of the individual is female? 2. What percentage of the individual over 45 years old is female? 3. What percentage of the individual under 5 is female? 4. What percentage of women are married? 5. What percentage of the women over the age of 18 are married? 6. Does this percentage vary among regions? 7. What is the status of individuals as compared to round 4? 8. What is the reason for household who left since round 4 9. What was the major occupation of household head? 10. What was the major occupation of household members aged 7 to 15?

    Food and cash crops The file p2s1b_rv1 contains information on production of food and cash crops. The data are at the crop level, meaning that each record represents one crop for one household. Only crops that are grown by each household are included in the file. The crop codes and labels are given in variable crop. You can answer the following questions with this file.

    1. How many households in the sample grow maize and wheat? 2. Among maize growers, what was the average area with maize? 3. Among maize growers, what was the average amount of maize harvested? 4. Among wheat growers, what was the average amount of wheat harvested? 5. Does the average amount of Maize harvested vary among regions? 6. Does the average amount of Wheat harvested vary among regions? 7. Among farmers with more than 1 hectare of maize, what was the average amount of

    maize harvested?

  • 23

    8. What is the average amount harvested for major cereal crops? (Teff, barely, wheat, maize and sorghum?)

    9. Farmers were asked Was any of the land cultivated under new extension program? What was the average response?

    10. Farmers were also asked Was any of the land cultivated irrigated? And % of the land irrigated. Explore them.

    SECTION4:CREATINGNEWVARIABLES In the previous sections, we described how to explore the data using existing variables. In this section, we discuss how to create new variables. When new variables are created, they are in memory and they will appear in the Data Browser, but they will not be saved on the hard-disk unless you use the save command. In this section, we will cover the following commands and options.

    generate replace tab , generate operators functions recode xtile

    generate

    This command is used to create a new variable. It is similar to compute in SPSS. The syntax is;

    generate newvar = exp [if exp]

    where exp is an expression like price*quant or 1000*kg. Several points about this command:

    Unlike compute in SPSS, generate cannot be used to change the definition of an existing variable. If you want to change an existing variable, you need to use replace,

    You can use gen or g as an abbreviation for generate If the expression is an equality or inequality, the variable will take the values 0 if the

    expression is false and 1 if it is true If you use if, the new variable will have missing values when the if statement is false

    For example,

    generate age2 = age*age create age squared variable gen yield = outputkg/area if area>0 create new yield variable if area is positive gen price = value/quant if quant>0 create new price variable if quant is positive gen highprice = (price>1000) creates a dummy variable equal to 1 for high prices

  • 24

    replace

    This command is used to change the definition of an existing variable. The syntax is the same:

    replace oldvar = exp [if exp] [in exp]

    Some points to remember:

    Replace cannot be used to create a new variable. Stata will give an error message if the variable does not exist.

    There is no abbreviation for replace. Stata wants to make sure you really want to change the variable.

    If you use the if option, then the old values will be retained when the if statement is false You can use the period (.) to represent missing values

    For example, replace price = avgprice if price > 100000 replaces high values with an average price replace income =. if income

  • 25

    Example 8: Using tab, gen to create dummy variables . tab q1a, gen(region) Region | Freq. Percent Cum. ------------+----------------------------------- Tigray | 150 10.33 10.33 Amhara | 466 32.09 42.42 Oromia | 396 27.27 69.70 7 | 139 9.57 79.27 8 | 134 9.23 88.50 9 | 167 11.50 100.00 ------------+----------------------------------- Total | 1,452 100.00 . tab region3 q1a==Oromia | Freq. Percent Cum. ------------+----------------------------------- 0 | 1,056 72.73 72.73 1 | 396 27.27 100.00 ------------+----------------------------------- Total | 1,452 100.00

    egen This is an extended version of generate[extended generate] to create a new variable by aggregating the existing data. It is a powerful and useful command that does not exist in SPSS. It adds summary statistics to each observation. To do the same thing in SPSS, you would need to create a new file with aggregate and merge it with the original file using match files. The syntax is:

    egen newvar = fcn(arguments) [if exp] [in range] , by(var)

    where newvar is the new variable to be created; fcn is one of numerous functions such as:

    count() number of non-missing values diff() compares variables, 1 if different, 0 otherwise fill() fill with a pattern group() creates a group id from a list of variables iqr() interquartile range ma() moving average max() maximum value mean() mean median() median min() minimum value pctile() percentile rank () rank rmean() mean across variables sd () standard deviation std() standardize variables

  • 26

    sum () sums

    argument is normally just a variable var in the by() subcommand must be a categorical variable Here are some other examples: egen avg = mean(yield) creates variable of average yield over entire sample egen avg2 = median(income), by(sex) creates variable of median income for each sex egen regprod = sum(prod), by(region) creates variable of total production for each region Example 9: Using egen to calculate averages . egen avecon=mean(cons), by( q1c) . gen highavecon=(cons> avecon) . list hhid q1c cons avecon highavecon in 650/675 +----------------------------------------------------------------+ | hhid q1c cons avecon highav~n | |----------------------------------------------------------------| 650. | 407070000039 Sirbana Godeti 673.582 940.6532 0 | 651. | 407070000040 Sirbana Godeti 793.05 940.6532 0 | 652. | 407070000041 Sirbana Godeti 985.257 940.6532 1 | 653. | 407070000042 Sirbana Godeti 844.477 940.6532 0 | 654. | 407070000043 Sirbana Godeti 946.014 940.6532 1 | |----------------------------------------------------------------| 655. | 407070000044 Sirbana Godeti 2206.057 940.6532 1 | 656. | 407070000045 Sirbana Godeti 570.0535 940.6532 0 | 657. | 407070000046 Sirbana Godeti 1340.926 940.6532 1 | 658. | 407070000047 Sirbana Godeti 901.222 940.6532 0 | 659. | 407070000048 Sirbana Godeti 887.775 940.6532 0 | |----------------------------------------------------------------| 660. | 407070000049 Sirbana Godeti 1026.795 940.6532 1 | 661. | 407070000051 Sirbana Godeti 1392.845 940.6532 1 | 662. | 407070000052 Sirbana Godeti 574.218 940.6532 0 | 663. | 407070000053 Sirbana Godeti 363.63 940.6532 0 | 664. | 407070000054 Sirbana Godeti 926.551 940.6532 0 | |----------------------------------------------------------------| 665. | 407070000055 Sirbana Godeti 1256.021 940.6532 1 | 666. | 407070000057 Sirbana Godeti 753.478 940.6532 0 | 667. | 407070000058 Sirbana Godeti 1378.575 940.6532 1 | 668. | 407070000059 Sirbana Godeti 1640.834 940.6532 1 | 669. | 407070000060 Sirbana Godeti 472.841 940.6532 0 | |----------------------------------------------------------------| 670. | 407070000062 Sirbana Godeti 721.425 940.6532 0 | 671. | 407070000063 Sirbana Godeti 1341.702 940.6532 1 | 672. | 407070000064 Sirbana Godeti 781.82 940.6532 0 | 673. | 407070000065 Sirbana Godeti 1962.697 940.6532 1 | 674. | 407070000070 Sirbana Godeti 945.045 940.6532 1 | |----------------------------------------------------------------| 675. | 407070000071 Sirbana Godeti 1742.247 940.6532 1 | +----------------------------------------------------------------+ In Example 9, we want to know which households have expenditure (cons) above the village average. First, we calculate the average expenditure for each village with the egen command. Then we create a dummy variable based on the expression (cons > avecons). The list output shows how the village average is repeated for every household in the village and confirms that the dummy variable is correctly calculated.

  • 27

    operators This is not a Stata command, but a topic related to creating new variables. Most of the operators are obvious, but some are not. Unlike SPSS, you cannot use words like or, and, eq, or gt.

    Arithmetic + addition - subtraction * multiplication / division ^ power

    Relational > greater than < less than >= more than or equal

  • 28

    gen DDfemale = 0 replace DDfemale = 1 if q1b==9 & sexh==0 or an easier way to do this would be: gen DDfemale = (q1b==9 & sexh==0)

    Or suppose you wanted to create a dummy variable for households in the two regions (Amhara and Oromia). This variable can be created with:

    gen amaoro = 0 replace amaoro = 1 if q1a==3 | q1a==4 or by one command: gen amaoro = (q1a==3 | q1a==4)

    You can also combine conditions using parentheses. Suppose you wanted a dummy variable that indicates if a household is a poor farmer in one of the Tigray and Amhara region. We will define poor as in the bottom 20 percent and use the variable poor. gen PDF = ((q1a==1 | q1b==3) & poor==1) Note: Here is a list of some of the more commonly-used additional functions used to create new variables in stata. Other functions can be found by typing help functions in the Stata Command window.

    abs(x) computes the absolute value of x exp(x) calculates e to the x power. ln(x) computes the natural logarithm of x log(x) is a synonym for ln(x), the natural logarithm. log10(x) computes the log base 10 of x. sqrt(x) computes the square root of x. invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z. normden(z) provides the standard normal density. normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not

    missing, otherwise, the result is missing. norm(z) provides the cumulative standard normal. group(x) creates a categorical variable that divides the data into x as nearly equal-

    sized subsamples as possible, numbering the first group 1, the second group 2, etc. It uses the current order of the data.

    int(x) gives the integer obtained by truncating x. round(x,y) gives x rounded into units of y.

  • 29

    recode This command changes the values of a categorical variable according to the rules specified. It is like the recode command in SPSS except that in Stata you do not necessarily use parentheses. The syntax is:

    recode varname old=new old=new . [if exp] [in range]

    Here are some examples: recode x 1=2 changes all values of x=1 to x= 2 recode x 1=2 3=4 changes 1 to 2 and 3 to 4 recode x 1=2 2=1 exchanges the values 1 and 2 in x recode x 1=2 *=3 changes 1 in x to 2 and all other values to 3 recode x 1/5=2 changes 1 through 5 in x to 2 recode x 1 3 4 5 = 6 changes 1, 3, 4 and 5 to 6 recode x .=9 changes missing to 9 recode x 9=. changes 9 to missing

    Notice that you can use some special symbols in the rules:

    * means all other values . means missing values x/y means all values from x to y x y means x and y

    For example, recode region value 8 and 9 to 7 Example 10: Using recode to define a new variable . tab q1a Region | Freq. Percent Cum. ------------+----------------------------------- Tigray | 150 10.33 10.33 Amhara | 466 32.09 42.42 Oromia | 396 27.27 69.70 7 | 139 9.57 79.27 8 | 134 9.23 88.50 9 | 167 11.50 100.00 ------------+----------------------------------- Total | 1,452 100.00 . recode q1a 8 9=7 (q1a: 301 changes made) . tab q1a Region | Freq. Percent Cum. ------------+----------------------------------- Tigray | 150 10.33 10.33 Amhara | 466 32.09 42.42 Oromia | 396 27.27 69.70 7 | 440 30.30 100.00 ------------+----------------------------------- Total | 1,452 100.00

  • 30

    xtile This command creates a new variable that indicates which category a record falls into, when the sample is sorted by an existing variable and divided into n groups of equal size. It is probably easier to explain with examples. xtile can be used to create a variable that indicates which income quintile a household belongs to, which decile in terms of farm size, or which tercile in terms of coffee production. The syntax is:

    xtile newvar = variable [if exp] [in range] , nq(#)

    where newvar is the new categorical variable created; variable is the existing variable used to create the quantile (e.g income, farm size); # is the number of different categories (eg 5 for quintiles, 3 for terciles)

    For example,

    xtile incquint = income, nq(5) xtile farmdec = farmsize, nq(10)

    Suppose we want to create a variable indicating the deciles of expenditure per capita.

    Example 11: Using xtile to generate deciles (using the ERHS99cons data) . xtile rconseadec= rconsae,nq(10)

    . tab rconseadec 10 | quantiles | of rconsae | Freq. Percent Cum. ------------+----------------------------------- 1 | 145 10.01 10.01 2 | 145 10.01 20.01 3 | 145 10.01 30.02 4 | 145 10.01 40.03 5 | 145 10.01 50.03 6 | 145 10.01 60.04 7 | 145 10.01 70.05 8 | 145 10.01 80.06 9 | 145 10.01 90.06 10 | 144 9.94 100.00 ------------+----------------------------------- Total | 1,449 100.00 . tab rconseadec sexh,col nofre 10 | quantiles | Sex of household head of rconsae | Female Male | Total -----------+----------------------+---------- 1 | 7.79 10.85 | 10.01 2 | 10.30 9.90 | 10.01 3 | 8.04 10.75 | 10.01 4 | 10.30 9.90 | 10.01 5 | 8.79 10.47 | 10.01 6 | 10.30 9.90 | 10.01 7 | 10.55 9.80 | 10.01 8 | 10.05 9.99 | 10.01 9 | 10.05 9.99 | 10.01 10 | 13.82 8.47 | 9.94

  • 31

    -----------+----------------------+---------- Total | 100.00 100.00 | 100.00 Exercise 2

    1. Use the file ERHScons1999. Create a variable called reg4 which indicates whether a household is in the Oromia or other regions. Then do a frequency table of the new variable.

    2. Using the same file, create a variable called hhquint that indicates the quintile of household size. Then do a frequency table on the new variable.

    3. Using the same file, create a dummy variable called enbugthat is equal to 1 if the household is the Enemayi and Bugena weredas and 0 otherwise. Then do a frequency table on the new variable.

    4. Create a new variable avgexp which is equal to the wereda average of food expenditure (food). Then calculate a new variable equal to the difference between the household food expenditure and the wereda average expenditure.

    5. Using the same file, create a new variable splot which is 1 if the person is cultivating single plots and 0 otherwise.

    6. Use file p1sec1_rv1. Create a set of dummy variables called relatxx based on the relationship of the person to the household head. For example, relat01 is a dummy for being the head, relat02 is a dummy for being the spouse, relat03 for a child, and so on.

    SECTION5:MODIFYINGVARIABLES In this section, we introduce some more powerful and flexible commands for generating results from survey data. We begin with an explanation of how to label data in Stata. Then see how to format variables. These are the topics and commands covered in this section:

    rename variable label variable label define label values format variable

    rename variables This command is used to rename variables in order to give other variable name. The command is . rename old_variable new_variable For instance, generate regional dummy variables and then: Example 12: renaming variable . tab q1a, gen(index) Region | Freq. Percent Cum. ------------+----------------------------------- Tigray | 150 10.33 10.33 Amhara | 466 32.09 42.42 Oromia | 396 27.27 69.70 SNNP | 440 30.30 100.00

  • 32

    ------------+----------------------------------- Total | 1,452 100.00 . tab index1 q1a==Tigray | Freq. Percent Cum. ------------+----------------------------------- 0 | 1,302 89.67 89.67 1 | 150 10.33 100.00 ------------+----------------------------------- Total | 1,452 100.00 . tab index2 q1a==Amhara | Freq. Percent Cum. ------------+----------------------------------- 0 | 986 67.91 67.91 1 | 466 32.09 100.00 ------------+----------------------------------- Total | 1,452 100.00

    rename index1 Tigray rename index1 variable to Tigray rename index2 Amhara rename index2 variable to Amhara rename index3 Oromia rename inxex 3 variable to Oromia rename index4 SNNP rename inxex4 variable to SNNP

    label variable This command is used to attach labels to variables in order to make the output easier to understand. For example, we know that Tigray is region1, SNNP are region 7. So we may want to label the variables as follows:

    label variable Tigray"Region 1" label variable Amhara"Region 3" label variable Oromia"Region 4 label variabel SNNP"Region 7"

    You can use the abbreviation label var If there are spaces in the label, you must use double quotation marks. If there are no spaces, quotation marks are optional. This command is like variable label in SPSS except that you can only label one variable per

    command and Stata uses double quotation marks, not single The limit is 80 characters for a label, but any labels over 30 characters will probably not look

    good in a table. label define

    This command gives a name to a set of value labels. For example, instead of numbering the regions, we can assign a label to each region. Instead of numbering the different sources of income, we can give them labels. The syntax is:

    label define lblname # "label" # "label" # label [, add modify]

  • 33

    where lblname is the name given to the set of value labels # are the value numbers labelare the value labels add means that you want to add these value labels to the existing set modify means that you want to change these values in the existing set

    Note that: You can use the abbreviation label def The double quotation marks are only necessary if there are spaces in the labels Stata will not let you define an existing label unless you say modify or add This command is similar to value label in SPSS except that in Stata you give the labels a name and later attach it to the variable, while in SPSS you attach it to the variable in the same command.

  • 34

    label values This command attaches named set of value labels to a categorical variable. The syntax is:

    label values varname [lblname] [, nofix]

    where varname is the categorical variable which will get the labels lblname is a set of labels that have already been defined by label define

    Here are some examples of labeling values in Stata.

    label define reg 1"Tigray" 3"Amhara" 4"Oromia" 7"SNNP",modify label values q1a reg

    Some additional commands that may be useful in labeling

    label dir to request a list of existing label names label list to request a list of all the existing value labels label drop to delete a one or more labels label save using to save label definitions as a Do-file label data to give a label to a data file format The format command allows you to specify the display format for variables. The internal precision of the variables is unaffected. The syntax for format command is . format varlist %fmt where %fmt is listed below: %fmt description example ----------------------------------------------------------------------------- Right-justified formats %#.#g general numeric format %9.0g %#.#f fixed numeric format %9.2f %#.#e exponential numeric format %10.7e %d default numeric elapsed date format %d %d... user-specified elapsed date format %dM/D/Y %#s string format %15s Right-justified, comma formats %#.#gc general numeric format %9.0gc %#.#fc fixed numeric format %9.2fc Leading-zero formats %0#.#f fixed numeric format %09.2f %0#s string format %015s Left-justified formats %-#.#g general numeric format %-9.0g %-#.#f fixed numeric format %-9.2f %-#.#e exponential numeric format %-10.7e %-d default numeric elapsed date format %-d

  • 35

    %-d... user-specified elapsed date format %-dM/D/Y %-#s string format %-15s Left-justified, comma formats %-#.#gc general numeric format %-9.0gc %-#.#fc fixed numeric format %-9.2fc Centered formats %~#s string format (special) %~15s ----------------------------------------------------------------------------- Exercise 3

    1. Use exercise 2 and label values and variables for newly created variables 2. label data file by This data is used for training 3. list existing label names

    SECTION6:ADVANCEDDESCRIPTIVESTATISTICS In Section 3, we have seen at preliminary descriptive statistics mostly applied to explore the nature of the data. In this section we further explore more advanced statistics. tabulate summarize This command creates one- and two-way tables that summarize continuous variables. The command tabulate by itself gives frequencies and percentages in each cell (cross-tabulations). With the summarize option, we can put means and other statistics of a continous variable. The syntax is:

    tabulate varname1 varname2 [if exp] [in range], summarize(varname3) options where

    varname1 is a categorical row variable varname2 is a categorical column variable (optional) varname3 is the continuous variable summarized in each cell options can be used to tell Stata which statistics you want

    Some notes regarding this command:

    The default statistics are the mean, the standard deviation, and the frequency. You can specify which statistics with options means, standard and freq You can use the abbreviation tabsum( )

    Some examples:

    tab q1a, sum(cons) gives the mean, std deviation, and frequency of per capita expenditure for each region

    tab q1b, sum(cons) mean gives the mean consumption for each village tab q1a sexh, sum(food) gives the mean, std deviation, and frequency in each cell of

    hh head sex per region

  • 36

    The first table is a one-way table (just one categorical variable) showing the mean, standard

    deviation, and frequency of per capita expenditure for each expenditure region. In the second table, we use the mean option so only mean per capita expenditure is shown. In the third table, we add a second categorical variable (sexh) making it a two-way table.

    Although we could have requested all the the default statistics in the two-way table, it makes the table difficult to read so we do not advise it.

    Example 13: Use tabulate. Sum () to generate tables . tab q1a, sum( cons) | Summary of consumption per month Region | Mean Std. Dev. Freq. ------------+------------------------------------ Tigray | 413.93552 297.701 149 Amhara | 545.91653 467.28072 465 Oromia | 697.09029 478.55749 395 SNNP | 331.7384 221.15601 440 ------------+------------------------------------ Total | 508.51838 420.4014 1449 . tab q1b, sum( cons) mean | Summary of consumption per month Wereda | Mean ------------+------------ Atsbi | 417.16834 Sebhassah | 409.87 Ankober | 301.87563 Basso na | 777.31823 Enemayi | 234.392 Bugena | 542.38657 Adaa | 940.65322 Kersa | 567.89355 Dodota | 526.58473 Shashemen | 775.34926 Cheha | 342.54209 Kedida Ga | 239.09955 Bule | 379.28676 Boloso | 266.93705 Daramalo | 416.28045 ------------+------------ Total | 508.51838 . tab q1a sexh, sum( cons) Means, Standard Deviations and Frequencies of consumption per month | Sex of household head Region | Female Male | Total -----------+----------------------+---------- Tigray | 342.44136 488.3678 | 413.93552 | 277.62091 301.46008 | 297.701 | 76 73 | 149 -----------+----------------------+---------- Amhara | 450.61424 582.89951 | 545.91653 | 368.60452 495.93838 | 467.28072 | 130 335 | 465

  • 37

    -----------+----------------------+---------- Oromia | 610.49528 728.85178 | 697.09029 | 518.32024 459.98768 | 478.55749 | 106 289 | 395 -----------+----------------------+---------- SNNP | 271.02927 346.48695 | 331.7384 | 171.91652 229.33158 | 221.15601 | 86 354 | 440 -----------+----------------------+---------- Total | 433.7347 536.83799 | 508.51838 | 389.69001 428.24021 | 420.4014 | 398 1051 | 1449 tabstat This command gives summary statistics for a set of continuous variable for each value of a categorical variable. The syntax is:

    tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname) where

    varlist is a list of continuous variables statname is a type of statistic varname is a categorical variable

    Some facts about this command:

    The default statistic is the mean. Optional statistics subcommands include mean, sum, max, min, range, sd (standard deviation),

    var (variance), skewness, kurtosis, median, and pn (nth percentile). Without the by() option, tabstat is like summarize except that it allows you to specify the list of

    statistics to be displayed. With the by() option, tabstat is like "tabulate summarize except that tabstat is more flexible in

    the statistics and format It is very similar to the SPSS command means.

    Examples

    tabstat food hhsize, stats(mean max min) gives mean, max, and min of food & hhsize

    tabstat food hhsize, by(q1a) gives mean of two variables for each region tabstat food, stats(median) by(q1a) gives the median food consumption for each

    region The tabstat command displays summary statistics for a series of numeric variables in a single table.

  • 38

    Example 14: Using tabstate to create Table . tabstat rconsae, s(mean p50 sd cv min max) by( rconseadec) missing Summary for variables: rconsae by categories of: rconseadec (10 quantiles of rconsae) rconseadec | mean p50 sd cv min max -----------+------------------------------------------------------------ 1 | 21.80935 21.9194 5.773654 .264733 4.811201 30.40175 2 | 36.24088 36.03099 3.400392 .0938275 30.6191 42.70621 3 | 48.52454 48.31921 3.09388 .0637591 42.74319 53.91997 4 | 60.38483 60.0903 3.811244 .0631159 54.00354 66.85229 5 | 73.09496 72.92955 3.61339 .0494342 66.90016 79.38206 6 | 89.3758 89.33151 5.708862 .0638748 79.39233 99.11871 7 | 110.407 110.2909 6.692319 .060615 99.12563 122.8186 8 | 137.7846 137.5525 9.298181 .0674835 123.5698 154.9666 9 | 179.5007 176.1209 17.33479 .0965723 155.0732 214.4674 10 | 332.2927 285.4411 135.2309 .4069633 214.4888 1212.256 . | . . . . . . -----------+------------------------------------------------------------ Total | 108.7874 79.38206 97.27053 .8941343 4.811201 1212.256 ------------------------------------------------------------------------ table This command creates a wide variety of tables. It is probably the most flexible and useful of all the table commands in Stata. The syntax is:

    table rowvar colvar [if exp] [in range], c(clist) [row col] where

    rowvar is the categorical row variable colvar is the categorical column variable clist is a list of statistic and variables row is an option to include a summary row col is an option to include a summary column

    Some useful facts about this command: The default statistic is the frequency. Optional statistics are mean, sd, sum, rawsum (unweighted), count, max, min, median, and pn

    (nth percentile). The c( ) is short for contents of each cell. Like tab, it can be used to create one- and two-way frequency tables, but table cannot do

    percentages Like tabsum, it can be used to calculate basic stats for each value of a categorical variable Its advantage over tabsum is that it can do more statistics and it can take more than one

    continious variable Like tabstat, it can be used to calculate advanced stats for each value of a categorical variable Its advantage over tabstat is that it can use do two (and more) way tables, but its disadvantage is

    that it has fewer statistics. It is similar to table in SPSS, but easier to learn and less flexible in formatting

    Here are some examples:

    table q1a , row table of frequencies by region with total row

  • 39

    table q1a, c(mean income) table of average income by region table q1a, c(mean yield sd yield median yield) table of yield statistics by region table q1a, c(mean yield) format(%9.2f) table of average yields by region with

    format . table q1a sexh, c(mean yield) table of average yield by region and sex table q1a sexh, c(mean income mean yield) table of avg yield & income by region & sex Some output from table commands is shown in Example 15. The table command calculates and displays tables of statistics, including frequency, mean, standard deviation, sum, and 1st to 99th percentile. The row and col option specifies an additional row and column to be added to the table, reflecting the total across rows and columns. Example 15: Tabulate median real per capita consumption by region vs sex of household head table q1a sexh, contents(p50 rconsae) row col missing | Sex of household head Region | Female Male Total ----------+----------------------------- Tigray | 73.05909 74.20448 73.56232 Amhara | 124.9734 95.00103 104.7363 Oromia | 98.59296 99.43469 98.75433 SNNP | 53.73735 50.34177 51.14911 | Total | 90.04483 77.18623 79.38206 . table rconseadec, c(mean rconsae) 10| quantiles | of | rconsae | mean(rconsae) ----------+-------------- 1 | 21.80935 2 | 36.24088 3 | 48.52454 4 | 60.38483 5 | 73.09496 6 | 89.3758 7 | 110.407 8 | 137.7846 9 | 179.5007 10 | 332.2927 Exercise 4

    1. Use ERHScons1999 and tabulate basic summery statistics showing mean, standard deviation and frequency of per capita food consumption for each village. Interpret the result.

    2. Repeat the same procedures as q1 but report only median of food consumption. 3. Tabulate basic summery statistics for food consumption by sex of household head and

    regions (use single table) 4. Tabulate mean 25p, median, 75p, sd, cv, min and max summery statistics for real food

    consumption per capita by deciles of real consumption per capita.

  • 40

    5. Tabulate median real food consumption per capita by sex of household head and deciles of real consumption per capita (use single table).

    SECTION7:PRESENTINGDATAWITHGRAPH(GRAPHINGDATA)

    This section provides a brief introduction to creating graphs. In Stata, all graphs are made with the graph command, but there are 8 types of charts and numerous subcommands for controlling the type and format of graph. In this section, we focus on four types of graph and a few options. The commands that draw graphs are graph twoway scatterplots, line plots, etc. graph matrix scatterplot matrices graph bar bar charts graph dot dot charts graph box box-and-whisker plots graph pie pie charts Graph commands can also used to produce histogram, box plot, kdensity, P-P plot, Q-Q plot but we will postpone until the introduction of normality later. Let us first acquaint ourselves with some twoway graph commands. A two way scatterplot can be drawn using (graph) twoway scatter command to show the relationship between two variables, cons (total consumption) and food (food consumption). As we would expect, there is a positive relationship between the two variables. . graph twoway scatter cons food

    010

    0020

    0030

    0040

    00co

    nsum

    ptio

    n pe

    r mon

    th

    0 1000 2000 3000 4000food cons per month

    We can show the regression line predicting cons from food using lfit option. . twoway lfit cons food

  • 41

    010

    0020

    0030

    0040

    00Fi

    tted

    valu

    es

    0 1000 2000 3000 4000food cons per month

    The two graphs can be overlapped like this . twoway (scatter cons hhsize) (lfit cons hhsize)

    010

    0020

    0030

    0040

    00

    0 5 10 15 20household size

    consumption per month Fitted values

    Exercise 5: Draw two way scatter with line fit graph for consumption per capita vs household size and explain its pattern.

  • 42

    SECTION8:NORMALITYANDOUTLIER Check for Normality An outlier is an observation that lies in an abnormal distance from other values in a random sample from a population. We must be extremely mindful of possible outliers and their adverse effects during any attempt to measure the relationship between two continuous variables. There are no official rules to identify outliers. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Sometimes it is obvious when an outlier is simply miscoded (for example, age reported as 230) and hence should be set to missing. But most times it is not the case. Before abnormal observations can be singled out, it is necessary to characterize normal observations. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. The skewness for a normal distribution is zero and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewing left, we mean that the left tail is heavier than the right tail. Similarly, skewing right means that the right tail is heavier than the left tail. Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case. The standard normal distribution has a kurtosis of zero. Positive kurtosis indicates a "peaked" distribution and negative kurtosis indicates a "flat" distribution. A value of 6 or larger on the true kurtosis indicates a large departure from normality. We can obtain skewness and kurtosis values by using detail option in summarize command. Clearly, variable rconspc(real consumption per capita) is skewed to the right and has a peaked distribution. Both statistics indicate the distribution of rconspc is far from normal. . sum rconspc Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- rconspc | 1449 90.36742 81.99623 4.22011 1018.295

  • 43

    . sum rconspc,detail real consumption per capita 1994 prices ------------------------------------------------------------- Percentiles Smallest 1% 11.65814 4.22011 5% 18.67906 6.865227 10% 25.10425 7.068164 Obs 1449 25% 39.94022 8.201794 Sum of Wgt. 1449 50% 65.99258 Mean 90.36742 Largest Std. Dev. 81.99623 75% 114.2533 577.1937 90% 180.8909 624.1437 Variance 6723.382 95% 236.1537 660.1689 Skewness 3.212314 99% 405.8775 1018.295 Kurtosis 21.69683 Besides commands for descriptive statistics, such as summarize, we can also check normality of a variable visually by looking at some basic graphs in Stata, including histograms, boxplots, kdensity, pnorm, and qnorm. Lets keep using rconspc from ERHScons1999.dta file for making some graphs. The histogram command is an effective graphical technique for showing both the skewness and kurtosis of rconspc. histogram rconspc

    0.0

    02.0

    04.0

    06.0

    08.0

    1D

    ensi

    ty

    0 200 400 600 800 1000real consumption per capita 1994 prices

    The normal option can be used to get a normal overlay. This shows the skew to the left in rconspc.

  • 44

    . histogram rconspc, normal

    0.0

    02.0

    04.0

    06.0

    08.0

    1D

    ensi

    ty

    0 200 400 600 800 1000real consumption per capita 1994 prices

    We can use the bin() option to increase the number of bins to 100. This better illustrates the distribution of rconspc. This option specifies how to aggregate data into bins. Notice that the histogram resembles a bell shape curve, but truncated at 0. . histogram rconspc, normal bin(100)

    0.0

    02.0

    04.0

    06.0

    08.0

    1D

    ensi

    ty

    0 200 400 600 800 1000real consumption per capita 1994 prices

    graph box draws vertical box plots. In a vertical box plot, the y axis is numerical, and the x axis is categorical. The upper and lower bounds of box are defined by the 25th and 75th percentiles of rconspc, and the line within the box is the median. The ends of the whiskers are 5th and 95th percentile of rconspc. graph box command can be used to produce a boxplot which can help us examine the distribution of rconspc. If rconspc is normal, the median would be in the center of the box and the end of whiskers would be equidistant from the box.

  • 45

    The boxplot for rconspc shows positive skew. The median is pulled to the low end of the box, and the 95th percentile is stretched out away from the box, for both male and female hh head. In fact it seems worse for male household head. . graph box rconspc, by(sexh)

    020

    040

    060

    080

    01,

    000

    Female Malere

    al c

    onsu

    mpt

    ion

    per c

    apita

    199

    4 pr

    ices

    Graphs by Sex of household head

    The kdensity command with the normal option displays a density graph of the residual with a normal distribution superimposed on the graph. This is particularly useful in verifying that the residuals are normally distributed, which is a very important assumption for regression. The plot shows that rconspc is more skewed to the right and has a higher mean than that of normal distribution. . kdensity rconspc, normal

    0.0

    02.0

    04.0

    06.0

    08.0

    1D

    ensi

    ty

    0 200 400 600 800 1000real consumption per capita 1994 prices

    Kernel density estimateNormal density

  • 46

    Graphical alternatives to the kdensity command are the P-P plot and Q-Q plot. pnorm command produces a P-P plot, which graphs a standardized normal probability. It should be approximately linear if the variable follows normal distribution. The straighter the line formed by the P-P plot, the more the variable's distribution conforms to the normal distribution. . pnorm rconspc

    0.00

    0.25

    0.50

    0.75

    1.00

    Nor

    mal

    F[(r

    cons

    pc-m

    )/s]

    0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)

    Qnorm command plots the quantiles of a variable against the quantiles of a normal distribution. If the Q-Q plot shows a line that is close to the 45 degree line, the variable is more normally distributed. . qnorm rconspc

    -500

    050

    010

    00re

    al c

    onsu

    mpt

    ion

    per c