eea stata training manual

Upload: getachew-a-abegaz

Post on 04-Jun-2018

345 views

Category:

Documents


24 download

TRANSCRIPT

  • 8/14/2019 EEA Stata Training Manual

    1/85

    Training Module

    Using Stata for Survey Data Analysis

    Ethiopian Economics Association/ Ethiopian Economic Policy Research Institute/

    September 2009

  • 8/14/2019 EEA Stata Training Manual

    2/85

  • 8/14/2019 EEA Stata Training Manual

    3/85

    2

    Section 5: Modifying variablesSection 6: Advanced descriptive statisticsSection 7: Presenting data with graph (graphing data)Section 8: Normality and outlierSection 9: Statistical testsSection 10: Linear regressionSection 11: Logistic regression

    Section 12: Panel data analysis (regression)Section 13: Data managementSection 14: Advanced programmingSection 15: Trouble shooting and update

    Each section will include some training in the use of Stata commands and a practical application ofthese commands to the analysis of household survey data. The ERHS1999, for example, containsover fifty files, but we will focus our attention on few of them:

  • 8/14/2019 EEA Stata Training Manual

    4/85

    3

    SECTION 1: INTRODUCTION TO STATA

    Stata is a package that offers a good combination of ease to learn and power. It has numerouspowerful yet simple commands for data management, which allows users to perform complexmanipulations with ease. Under Stata/SE, one can have up to 32,768 in a Stata data file and11,000 for any estimation commands.

    Stata performs most general statistical analyses (regression, logistic regression, ANOVA, factoranalysis, and some multivariate analysis). The greatest strengths of Stata are probably inregression and logistic regression. Stata also has a very nice array of robust methods that are veryeasy to use, including robust regression, regression with robust standard errors, and many otherestimation commands include robust standard errors as well.

    Stata has the ability to easily download programs developed by other users and the ability tocreate your own Stata programs that seamlessly become part of Stata. One can find many cuttingedge statistical procedures written by other users before and incorporate them into his/her ownStata program. Stata uses one line commands which can be entered one command at a time orcan be entered many at a time in a Stata program.

    When you open Stata, you will see a menu bar across the top, a tool bar with buttons, and 3-5windows (the number of windows open depends on which windows were open the last time Statawas used). Each is described briefly below.

    The Stata Interface

    1. Windows

    The Stata windows give you all the key information about the data file you are using, recentcommands, and the results of those commands. Some of them open automatically when you start

    Stata, while others can be opened using the Windows pull-down menu or the buttons on the toolbar.

    These are the Stata windows:Stata Results To see recent commands and outputStata Command To enter a commandStata Browser To view the data file (needs to be opened)Stata Editor To edit the data file (needs to be opened)Stata Viewer To get help on how to use StataVariables To see a list of variablesReview To see recent commands

    Stata Do-file Editor To write or edit a program (needs to be opened)

  • 8/14/2019 EEA Stata Training Manual

    5/85

    4

    The Command windowon the bottom right is where you'll enter commands. When you press

    ENTER, they are pasted into theStata Results

    window above, which is where you will see yourcommands executed and view the results. You can also use recent commands again by using thePage Up key (to go to the previous command) and Page Down key (to go to the next command).

    The Result Window (with the black background) shows all recent commands, output, errormessages, and help. The text is color-coded as follows:

    Green General information and the frame and headings of output tables blue Commands or error messages that can be clicked on for more information white Stata commands yellow Numbers in output tables red Error messages

    The slide bar on the right side can be used to look at earlier results that are not on the screen.However, unlike SPSS, the Stata results window does not keep all output generated. It will keepabout 300-600 lines of the most recent output, deleting earlier output. If you want to store outputin a file, you must use the logcommand. /More on this latter/

    Stata Browser This window shows all the data in memory. The Stata Browser does not appearautomatically when you start Stata. The only way to open the Browser is to click on the buttomwith a table and magnifying glass. Unlike SPSS, when the Stata Browser is open, you cannotexecute any commands, either from the Stata Command window or from the Do-file Editor. In

  • 8/14/2019 EEA Stata Training Manual

    6/85

    5

    addition, you also cannot change any of the data. You can, however, sort the data or hide certainvariables using buttons at the top of the Stata Browser window.

    Stata Editor This window is exactly like the Stata Browser window except that you can changethe data. We do not recommend using this window because you will have no record of thechanges you make in the data. It is better to correct errors in the data using a Do-file programthat can be saved.

    Stata Viewer This window provides help on Stata commands and rules. To open the StataViewer window, you can click on Windows/Viewer or click on the eye button on the tool bar. Touse the Stata Viewer window, type a command in the space at the top and the Viewer will giveyou the purpose and rules for using that command, along with some examples. Any blue text inthe Viewer can be clicked on for more information about that command.

    Variables This window (tall with a white background) lists all the variables that exist inmemory. When you open a Stata data file, it lists the variables in the file. If you create newvariables, they will be added to the list of variables. If you delete variables, they will be removedfrom the list. You can insert a variable into the Stata Command window by clicking on it in the

    Variables window.

    Do-file Editor This window allows you to write, edit, save, and execute a Stata program. A Stataprogram (or Do-file) is simply a set of Stata commands written by the user. The advantage of using theDo-file Editor rather than the Stata Command window is that the Do-file allows you to save, revise, andrerun a set of commands. Exploratory analysis of the data can be done with the Stata Command window,but any serious data analysis should be carried out using the Do-file Editor, not the StataCommand window. The Do-File Editor can be opened by clicking on Windows/Do-file Editor or byclicking on the envelope button. With so many windows, it is sometimes difficult to fit them all on thescreen. You can adjust the size and position of each window the way you like it and then save the layoutby clicking on Prefs/Save Windowing Preferences. Each time you open Stata, the windows will bearranged according to your prefered layout.

    On the right are two convenient windows. TheVariableswindowkeeps a list of your currentvariables. If you click on one of them, its name will be pasted into the current command at thelocation of the cursor, which saves a little typing. The Review windowkeeps a list of all thecommands you've typed in the Stata session. Click on one, and it will be pasted into thecommand window, which is handy for fixing typos. Double-click, and the command will bepasted and re-executed. You can also export everything in the Reviewwindow into a .do file(more on them later) so you can run the exact same commands at any time. To do this right-clickthe Reviewwindow.

    When we first open Stata, all these windows are blank except for the Stata Resultswindow.You

    can resize these 4 windows independently, and you can resize the outer window as well. To saveyour window size changes, click on Prefsbutton, then Save Windowing Preferences

    Entering commands in Stata works pretty much like you expect. BACKSPACE deletes thecharacter to the left of the cursor, DELETE the character to the right, the arrow keys move thecursor around, and if you type the text is inserted at the current location of the cursor. The uparrow does not retrieve previous commands, but you can do that by pressing PAGE UP, orCTRL-R, or by using the Reviewwindow.

  • 8/14/2019 EEA Stata Training Manual

    7/85

    6

    2. Menus

    Stata displays 8 drop-down menus across the top of the outer window, from left to right:File

    Open open a Stata data file (use)Save/Save as save the Stata data in memory to diskDo execute a do-file

    Filename copy a filename to the command linePrint print log or graphExit quit Stata

    EditCopy/Paste copy text among the Command, Results, and Log windowsCopy Table copy table from Results window to another fileTable copy options what to do with table lines in Copy Table

    DataGraphics

    Statistics build and run Stata commands from menusUser menus for user-supplied Stata commands (download from Internet)Window bring a Stata window to the frontHelp Stata command syntax and keyword searches

    3. Button bar

    The buttons on the button bar are from left to right (equivalent command is in bold):Open a Stata data file: useSave the Stata data in memory to disk: savePrint a log or graphOpen a log, or suspend/close an open log: logOpen a new viewerBring Graph window to frontNew Dofile Editor: doeditEdit the data in memory: editBrowse the data in memory: browseClear-more condition: Space BarStop current command or do-file: Ctrl-Break

    SECTION 3: EXPLORING DATA FILES

    3.1. Common Stata Syntax

    This section covers commands that are used for preliminary exploration of data in a file. Statacommands follow the same syntax:

    [byvarilist1:] command[varlist2] [ifexp] [inrange] [weight], [options]

    Items inside of the squares brackets are either options or not available for every command. Thissyntax applies to all Stata commands. In order to use byprefix, the dataset must first be sorted onthe by variable(s). it helps to repeat Stata command on subsets of the data.

  • 8/14/2019 EEA Stata Training Manual

    8/85

    7

    Logical operators used in Stata

    ~ Not

    == Equal

    ~= not equal

    != not equal

    > greater than

    >= greater than or equal

    < less than

  • 8/14/2019 EEA Stata Training Manual

    9/85

    8

    4. iweight, or importance weights, are weights that indicate the "importance" of the observationin some vague sense. iweights have no formal statistical definition; any command that supportsiweights will define exactly how they are treated. In most cases, they are intended for use byprogrammers who who need to implement their own analytical techniques by using some of theavailable estimation commands. Special care should be taken when using importance weights tounderstand how they are used in the formulas for estimates and variance. This information isavailable in the Methods and Formulas section in the Stata manual for each estimation command.In general, these formulas will be incorrect for computing the variance for data from a samplesurvey.

    3.2 Examining dataset

    clearThe clear command deletes all files, variables, and labels from the memory to get ready to use anew data file. You can clear memory using the clear command or by using the clear up commandas part of the use command (see the use command). This command does not delete any datasaved to the hard-drive.

    set memoryFirst you can check to see how much memory is allocated to hold your data using the memorycommand. For instance, we are now running StataSE 11 under Windows, and this is what thememorycommand told us.

  • 8/14/2019 EEA Stata Training Manual

    10/85

    9

    Fi gur e 2: Worki ng memory space. memor y

    byt es- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Det ai l s of set memory usage

    overhead ( poi nters ) 5, 808 0. 06% dat a 107, 448 1. 02% - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    data + overhead 113, 256 1. 08% f r ee 10, 372, 496 98. 92% - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Total al l ocat ed 10, 485, 752 100. 00%- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ot her memor y usage

    set maxvar usage 1, 816, 666set mat si ze usage 1, 315, 200pr ogr ams, saved r esul t s, et c. 3, 338

    - - - - - - - - - - - - - - -Total 3, 135, 204

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Gr and t otal 13, 620, 956

    We have 11MB free for reading in a data file. Whenever we want to read data file bigger thanthis free bytes, we will get the error message read as:

    no r oom t o add more obser vat i onsr ( 901) ;

    In this case I have to allocate to more memory, say 25MB (if 25MB are sufficient for currentfile), with the set memorycommand before trying to use my file.

    set memory 25m

    Figure 3: Current memory allocation after set memory 25m command

    Current memory allocation

    current memory usagesettable value description (1M = 1024k)--------------------------------------------------------------------set maxvar 5000 max. variables allowed 1.733Mset memory 25M max. data space 25.000Mset matsize 400 max. RHS vars in models 1.254M

    -----------

    27.987M

    Now that we have allocated enough memory, we will be able to read bigger files provided that itis within the specified memory spaces. After setting the memory space to 25m, we haveinformation on memory space read us:

  • 8/14/2019 EEA Stata Training Manual

    11/85

    10

    Figure 4: Adjusted working memory space. memor y

    byt es- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Det ai l s of set memory usage

    overhead ( poi nters ) 5, 808 0. 02% dat a 107, 448 0. 41% - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    data + overhead 113, 256 0. 43% f r ee 26, 101, 136 99. 57% - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Total al l ocat ed 26, 214, 392 100. 00%- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ot her memor y usage

    set maxvar usage 1, 816, 666set mat si ze usage 1, 315, 200pr ogr ams, saved r esul t s, et c. 1, 778

    - - - - - - - - - - - - - - -Total 3, 133, 644

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Gr and t otal 29, 348, 036

    If we want to allocate 25m (250 megabytes) every time we start Stata, We can type;

    . set memory 250m, permanently

    And then Stata will allocate this amount of memory every time we start Stata.

    use

    This command opens an existing Stata data file. The syntax is:

    use filename [, clear ] opens new fileuse [varlist] [if exp] [in range] using filename [, clear ] opens selected parts of file

    If there is no extension, Stata assumes it is .dta. If there is no path, Stata assumes it is in the current folder. You can use a path name such as: use C:\...\ERHScons1999 If the path name has spaces, you must use double quotes: use .d:\my

    data\ERHScons1999. You can open selected variables of a file using a variable list. You can open selected records of a file using ifor in.

    Here are some examples of the use command:use ERHScons1999 opens the file ERHScons1999.dta for analysis.use ERHScons1999 if q1a == 1 opens data from region 1use ERHScons1999 in 5/25 opens records 5 through 25 of fileuse hhid hhsize cons using ERHScons1999 opens 3 variables from ERHScons1999 fileuse C:\training\ ERHScons1999 opens the file ERHScons1999.dta in the specified

    folder

    use .C:\data files\ ERHScons1999 use quotation marks if there are spacesuse ERHScons1999, clear clears memory before opening the new file

  • 8/14/2019 EEA Stata Training Manual

    12/85

    11

    While running Do-file program, we have to use use and clear command at the same time.For instance, here we load a raw data set from ERHScons1999. The clear option then allowsStata to clear the memory of previous data set in order to load the new one.

    . use C:\...\ERHScons1999.dta, clear

    As Stata did not want you to lose the changes that you made to the data setting in memory. If youreally want to discard the changes in memory, clear option specifies that it is okay to replace thedata in memory, even though the current data have not been saved to disk.

    saveThe savecommand will save the dataset as a .dta file under the name you choose. Editing thedataset changes data in the computer's memory, it does not change the data that is stored on thecomputer's disk.

    . save C:\...\consumption.dta, replace

    The replaceoption allows you to save a changed file to the disk, replacing the original file. Statais worried that you will accidentally overwrite your data file. You need to use the replaceoptionto tell Stata that you know that the file exists and you want to replace it.

    editThis command use to open window called data editor window that allow us to view allobservation in the memory. You can change the data using data editor window but you do notrecommend using this window because you will have no record of the changes you make in thedata. It is better to correct errors in the data using a Do-file program that can be saved (we willsee Do-file program latter).

    browseThis window is exactly like the Stata editor window except that you cant change the data.

    describe

    This command provides a brief description of the data file. You can use des or d and Statawill understand. The output includes:

    the number of variables the number of observations (records) the size of the file the list of variables and their characteristics

  • 8/14/2019 EEA Stata Training Manual

    13/85

    12

    Example 1: Using describe to show information about a data file. des

    Cont ai ns dat a f r omC: \ t r ai ni ng\ ERHSCONS1999. dt aobs: 1, 452

    var s: 15 24 Feb 2007 07: 07si ze: 113, 256 ( 98. 9% of memor y f r ee) ( _dta has not es)

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    st or age di spl ay val uevar i abl e name t ype f ormat l abel var i abl e l abel- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -q1a f l oat %9. 0g r eg Regi onq1b doubl e %15. 0g w Wer edaq1c doubl e %17. 0g pa Peseant associ at i onq1d doubl e %12. 0g Househol d i dsexh byt e %8. 0g sexhh Sex of househol d headageh f l oat %9. 0g p1s1q4 Age of househol d headcons f l oat %9. 0g consumpt i on per mont hf ood f l oat %9. 0g f ood cons per mont hhhsi ze byt e %8. 0g househol d si zeaeu f l oat %9. 0g adul t equi val ent uni t s i n

    househol df pi f l oat %9. 0g f ood pr i ce i ndexr conspc f l oat %9. 0g r eal consumpt i on per capi t a

    1994 pr i cesr consae f l oat %9. 0g r eal consumpt i on per adul t 1994

    pr i cespoor doubl e %8. 2fhhi d doubl e %12. 0f sel ected househol d uni que i d- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Sort ed by: hhi d

    It also provides the following information on each variable in the data file:

    the variable name the storage type: byte is used for binary variables, int is used for integers, and float is

    used for continuous variables that may have decimals. To see the limits on each storagetype, type help data types.

    the display type indicates how it will appear in the output. the value label is the name of a set of labels for different values the variable label is a name for the variable that is used in output.

    listThis command lists values of variables in data set. The syntax is:

    list [varlist] [if exp] [in range]

    With varlist, you can specify which variables values will be presented. If list is not specified, allvariables will be listed. With if and in, you can specify which records will be listed. Here aresome

    examples:. list lists entire dataset. listin1/10 lists observations 1 through 10

  • 8/14/2019 EEA Stata Training Manual

    14/85

    13

    . list hhsize q1a food lists selected variables

    . list hhsize sex in1/20 lists observations 1-20 for selected variables

    . list ifq1a < 6 lists cases in region is 1 through 5

    ifThis command is used to select certain records in carrying out a command. This is similar to the

    process if command in SPSS, except that in Stata it is not considered a separate command. Thesyntax is:

    command ifexp

    Examples include:

    . list hhid q1a foodiffood> 2000 lists data if food is above 12000

    . tab q1aifcons>1000 &cons=1200 browse data if food consumption is above 1200

    Note that if statements always use ==, not a single =. Also note that | indicates or while &indicates and.

    inWe have also used into select records based on the case number. The syntax is:

    command inexp

    For example:. listin10 list observation number 10. summarize in10/20 summarize observations 10-20

    Example 2: Using list to look at data

    . l i st hhi d q1a q1b q1c q1d hhsi ze rconspc i n 10/ 25

    +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +| hhi d q1a q1b q1c q1d hhsi ze r conspc || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |

    10. | 101010000010 Ti gray At sbi Haresaw 10 4 134. 5961 |11. | 101010000011 Ti gray At sbi Haresaw 11 3 168. 9437 |12. | 101010000012 Ti gray At sbi Haresaw 12 3 135. 1815 |13. | 101010000013 Ti gray At sbi Haresaw 13 7 102. 3454 |14. | 101010000014 Ti gray At sbi Haresaw 14 9 68. 04964 |

    | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |15. | 101010000015 Ti gray Atsbi Haresaw 15 12 49. 61188 |16. | 101010000016 Ti gray At sbi Haresaw 16 4 85. 05015 |

    17. | 101010000017 Ti gray At sbi Haresaw 17 5 84. 72104 |18. | 101010000018 Ti gray At sbi Haresaw 18 2 95. 42028 |19. | 101010000019 Ti gray Atsbi Haresaw 19 10 140. 7843 |

    | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |20. | 101010000020 Ti gray At sbi Haresaw 20 3 80. 58356 |21. | 101010000021 Ti gray At sbi Haresaw 21 3 95. 98959 |22. | 101010000022 Ti gray At sbi Haresaw 22 5 68. 05075 |23. | 101010000023 Ti gray At sbi Haresaw 23 4 52. 4964 |24. | 101010000024 Ti gray At sbi Haresaw 24 3 91. 86269 |

    | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |25. | 101010000025 Ti gray At sbi Haresaw 25 5 149. 1702 |

    +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +

  • 8/14/2019 EEA Stata Training Manual

    15/85

    14

    . l i st q1a cons aeu poor i n 200/ 215

    +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +| q1a cons aeu poor || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |

    200. | Amhar a 661. 3979 1. 82 0. 00 |201. | Amhar a 321. 7693 8. 14 1. 00 |

    202. | Amhar a 169. 784 2. 3 0. 00 |203. | Amhar a 907. 9995 3. 14 0. 00 |204. | Amhar a 232. 6273 4. 148 1. 00 |

    | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |205. | Amhar a 432. 4525 6. 86 1. 00 |206. | Amhar a 59. 53 1. 46 1. 00 |207. | Amhar a 228. 22 3. 4 0. 00 |208. | Amhar a 1298. 875 5. 44 0. 00 |209. | Amhar a 144. 494 3. 48 1. 00 |

    | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |210. | Amhar a 266. 974 4. 28 0. 00 |211. | Amhar a 43. 97179 . 74 1. 00 |212. | Amhar a 216. 0467 3. 408 1. 00 |213. | Amhar a 492. 4958 2. 94 0. 00 |

    214. | Amhar a 437. 7144 2. 46 0. 00 || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |

    215. | Amhar a 166. 354 1. 74 0. 00 |+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +

    If you are not careful with list, you will get a lot more output than you want. If Stata startsgiving you more output than you really want, use the stop button ( button with an X).

    codebookThe codebookcommand is a great tool for getting a quick overview of the variables in the datafile. It produces a kind of electronic codebook from the data file, displaying information about

    variables' names, labels and values.

  • 8/14/2019 EEA Stata Training Manual

    16/85

    15

    Example 3: using codebook to look at data. codebook

    sexh Sex of househol d head- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    t ype: numer i c ( byte)l abel : sexhh

    r ange: [ 0, 1] uni t s: 1uni que val ues: 2 mi ssi ng . : 0/ 1452

    t abul at i on: Fr eq. Numer i c Label400 0 Femal e

    1052 1 Mal e

    . codebookr conspc r eal consumpt i on per capi t a 1994 pr i ces- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    t ype: numer i c ( f l oat )

    r ange: [ 4. 2201104, 1018. 2954] uni t s: 1. 000e- 07uni que val ues: 1448 mi ssi ng . : 3/ 1452

    mean: 90. 3674st d. dev: 81. 9962

    per cent i l es: 10% 25% 50% 75% 90% 25. 1043 39. 9402 65. 9926 114. 253 180. 891

    inspectIt is another useful command for getting a quick overview of a data file. inspectcommand

    displays information about the values of variables and is useful for checking data accuracy.

    Example 4: Using inspect to look at data. i nspect sexh

    sexh: Sex of househol d head Number of Observat i ons- - - - - - - - - - - - - - - - - - - - - - - - - - - - Non-

    Total I nteger s I nteger s| # Negat i ve - - -| # Zer o 400 400 -| # Posi t i ve 1052 1052 -| # - - - - - - - - - - - - - - -| # # Total 1452 1452 -

    | # # Mi ssi ng -+- - - - - - - - - - - - - - - - - - - - - - - - - - -0 1 1452

    ( 2 uni que val ues)

    sexh i s l abel ed and al l val ues are document ed i n t he l abel .

  • 8/14/2019 EEA Stata Training Manual

    17/85

    16

    countcount command can be used to show the number of observations that satisfying if options. If noconditions are specified, count displays the number of observations in the data.

    . count1452

    . count i f q1a==3466

    3.3. Preliminary Descriptive Statistics

    tabulate, tab1, tab2These are three related commands that produce frequency tables for discrete variables. They canproduce one-way frequency tables (tables with the frequency of one variable) or two-wayfrequency tables (tables with a row variable and a column variable. These commands are similarto the frequency and crosstab commands in SPSS. How do they differ?

    tabulate or tab produce a frequency table for one or two variables tab1 produces a one-way frequency table for each variable in the

    variable list tab2 produces all possible two-variable tables from the list of variables

    You can use several options with these commands: all gives all the tests of association for two-way tables cell gives the overall percentage for two-way tables column gives column percentages for two-way tables row gives row percentages for two-way tables nofreq suppresses printing the frequencies. chi2 provides the chi squared test for two-way tables

    There are many other options, including other statistical tests. For more information, type helptabulate

    Some examples of the tabulate commands are:. tabulate q1a produces table of frequency by region. tabulate q1a sexh produces a cross-tab of frequencies by region and sex of head. tabulate q1a hhsize, row produces a cross-tab by region and hhsize with row percentages. tabulate sexh hhsize, cell nofreq produces a cross-tab of overall percent by sex and hhsize.. tab1 q1a q1b hhsize produces three tables, a frequency table for each variable

    . tab2 q1a poor sexh produces three tables, a cross-tab of each pair of variables

  • 8/14/2019 EEA Stata Training Manual

    18/85

    17

    Example 5: Using tabulate on categorical variables. t ab q1b

    Wereda | Fr eq. Percent Cum.- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    At sbi | 84 5. 79 5. 79Sebhassahsi e | 66 4. 55 10. 33

    Ankober | 86 5. 92 16. 25Basso na Worana | 175 12. 05 28. 31

    Enemayi | 61 4. 20 32. 51Bugena | 144 9. 92 42. 42

    Adaa | 95 6. 54 48. 97Kersa | 95 6. 54 55. 51

    Dodota | 109 7. 51 63. 02Shashemene | 97 6. 68 69. 70

    Cheha | 65 4. 48 74. 17Kedi da Gamel a | 74 5. 10 79. 27

    Bul e | 134 9. 23 88. 50Bol oso | 96 6. 61 95. 11

    Daramal o | 71 4. 89 100. 00

    - - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00

    . t ab q1b sexh

    | Sex of househol d headWereda | Femal e Mal e | Total

    - - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -At sbi | 48 36 | 84

    Sebhassahsi e | 29 37 | 66Ankober | 13 73 | 86

    Basso na Wor ana | 52 123 | 175Enemayi | 11 50 | 61Bugena | 55 89 | 144

    Adaa | 23 72 | 95Kersa | 31 64 | 95

    Dodot a | 26 83 | 109Shashemene | 26 71 | 97

    Cheha | 22 43 | 65Kedi da Gamel a | 15 59 | 74

    Bul e | 11 123 | 134Bol oso | 25 71 | 96

    Daramal o | 13 58 | 71- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -

    Total | 400 1, 052 | 1, 452

    In one-way tables, Stata gives the count, the percentage, and the cumulative percentage(see first example in box).

    In two-way tables, Stata gives the count only, unless you ask for other statistics (seesecond example in box)

    col, row, and cell request Stata to include percentages in two-way tables

    summarizeThe summarize command produces statistics on continuous variables like age, food, cons hhsize.The syntax looks like this:

  • 8/14/2019 EEA Stata Training Manual

    19/85

    18

    summarize [varlist] [if exp] [in range] [, [detail]]

    By default, it produces the following statistics: Number of observations Average (or mean) Standard deviation Minimum Maximum

    If you specify detail Stata gives you additional statistics, such as skewness, kurtosis, the four smallest values the four largest values various percentiles.

    Here are some examples:.summarize gives statistics on all variables

    . summarize hhsize food gives statistics on selected variables

    . summarize hhsize cons if q1a==3 gives statistics on two variables for one region

    Example 6: Using summarize to study continuous variables. sum r conspc r consae hhsi ze

    Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    r conspc | 1449 90. 36742 81. 99623 4. 22011 1018. 295r consae | 1449 108. 7874 97. 27053 4. 811201 1212. 256hhsi ze | 1452 5. 782369 2. 740968 1 17

    . sum r conspc r consae hhsi ze i f q1a==4

    Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    r conspc | 395 111. 6185 99. 09839 8. 393298 1018. 295r consae | 395 132. 6018 116. 6133 9. 608795 1212. 256hhsi ze | 396 6. 209596 2. 853203 1 16

    The first example gives the statistics for the whole sample, while the second gives the statisticsonly for households in Region 4.

    byThis prefix goes before a command and asks Stata to repeat the command for each value of avariable. The general syntax is:

    by varlist: command

    Note: bysortcommand is most commonly used to shorten the sorting process

  • 8/14/2019 EEA Stata Training Manual

    20/85

    19

    Some examples of the by prefix are:

    bysort sex: sum rconsae for sex of hh head, give stats on real per capitaconsumption.

    Example 7: Using the by prefix

    - > sexh = Femal e

    Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    r conspc | 398 100. 2183 89. 18895 7. 068164 624. 1437- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    - > sexh = Mal e

    Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    r conspc | 1051 86. 63701 78. 82594 4. 22011 1018. 295

    help

    The help command gives you information about any Stata command or topic

    help [command]

    For example,. help tabulate gives a description of the tabulate command. help summarize gives a description of the summarize command

    SECTION 4: STORING COMMANDS AND OUTPUT

    In this section, we discuss how to store commands and output for later use. First, we describehow to store commands using a program (Stata calls it a Do-file), how to edit the program, andhow to run it. Second, we present different ways of saving and using the output generated byStata. The following topics are covered:

    Using the Do-file Editorlog usinglog offlog onlog close

    Using the Do-file Editor

    The Do-file Editor allows you to store a program (a set of commands) so that you can edit it andexecute it later. Why use the Do-file Editor?

    It makes it easier to check and fix errors, It allows you to run the commands later, It lets you show others how you got your result, and It allows you to collaborate with others on the analysis.

  • 8/14/2019 EEA Stata Training Manual

    21/85

    20

    In general, any time you are running more than 10 commands to get a result, it is easier and saferto use a Do-file to store the commands.

    To open the Do-file Editor, you can click on Windows/Do-file Editor or click on the envelope onthe Tool Bar.

    Within the Do-file Editor, there is a menu bar and tool bar buttons to carry out a variety ofediting functions. The menu bar is similar to the one in Microsoft Word:

    File/New to open a new, blank Do-fileFile/Open to open an existing Do-fileFile/Save to save the current Do-fileFile/Save as to saving the current Do-file under a new nameFile/Insert file to insert another file into the current oneFile/Print to print the Do-fileFile/Close to close the Do-fileEdit/Undo to undo the last command

    Edit/Cut to delete or move the marked text in the Do-fileEdit/Copy to copy the marked text in the Do-fileEdit/Paste to insert the copied or cut text into the Do-fileSearch/Find to find a word or phrase in the Do-textSearch/Replace to find and replace a word or phrase in the Do-fileTools/Do to execute all the commands or the marked commands in the Do-fileTools/Run to execute all the commands or the marked commands in the Do-file

    without showing any output in the Stata Results window

    The tool bar buttons can be used to carry out some of these tasks more quickly. For example,there are buttons for File/New, File/Open, File/Print, Search/Find, Edit/Cut, Edit/Copy,

    Edit/Paste, Edit/Undo, Do, and Run. Probably the button you will use most is the last one thatshows a page with text on it. This is the Do button for executing the program or the markedpart of the program.

    Finally, the keyboard commands may be even quicker to use than the buttons. The most usefulkeyboard commands are:

    Control-O Open fileControl-S Save fileControl-C CopyControl-X Cut

    Control-V PasteControl-Z UndoControl-F FindControl-H Find and Replace

    To run the commands in a Do-file, you can click on the Do button (the last one) or click onTools/Do. If you want to run one or just a few commands rather than the whole file, mark thecommands and click on the Do button. You do not have to mark the whole command, but at leastone character in the command must be marked in order for the command to be executed (unlike

  • 8/14/2019 EEA Stata Training Manual

    22/85

    21

    SPSS, it is not enough to have the cursor on a command). Although layout is a matter of personalpreference, it may be useful to have the Stata Results window and the other windows on one sideof the screen and the Do-file Editor window on the other. This makes it easy to switch back andforth. When you arrange the windows the way you like, you can save the layout by clickingPrefs/Save Windowing Preferences. Each time you open Stata, it will use your chosen layout.

    Note: If you would like to add a note to a do file, but do not want Stata to execute your notes, /**/ is used.

    /* This Stata program illustrates how to read create a do file */

    log using C:\...\eeatraining.log,replacelog close

    Saving the OutputAs mentioned in earlier section, the Stata Results window does not keep all the output yougenerate. It only stores about 300-600 lines, and when it is full, it begins to delete the old resultsas you add new results. You can increase the amount of memory allocated to the Stata ResultsWindow. But even this will probably not be enough for a long session with Stata. Thus, we need

    to use logto save the output.

    There are four ways to control the log operations.1. You can use the log button on the tool bar. It looks like a scroll.2. You can click on File/Log to get four options: Begin (log using), Close, Suspend (log

    off), and resume (log on).3. You can use .log. commands in the Stata Command window4. You can use .log. commands in the Stata Do-file Editor.

    In this section, we describe the commands, which can be used in the Stata Command window orin a do-file (program).

    log using

    This command creates a file with a copy of all the commands and output from Stata. The firsttime you open a log, you must give a name to the new file to be created. The syntax is:

    log using filename [, append replace [ text | smcl ] ]

    where filename is that name you give the new file. The options are:

    append adds the output to an existing file

    replace replaces an existing file with the outputtext tells Stata to create the log file in text (ASCII) formatsmcl tells Stata to create the log file in SMCL format

    Here are some examples:

    log using temp22 saves output to a file called temp22log using temp20, replace saves output to an existing file, temp20, replacing contentlog using regoutput, append saves output to an existing file, results, adding to contentslog using .d:\my data\myfile.txt. saves output in specified file in specified folder

  • 8/14/2019 EEA Stata Training Manual

    23/85

    22

    Several points should be remembered in using this command:

    if you use an existing file name but do not say replace or append, Stata will givean error message that the file already exists

    log files in text format can be opened with Wordpad, Notepad, the DOS editor, or anyword processor., but the file does not have any formatting

    smcl files have formatting (bold, colors, etc) but can only be opened with Stata smcl format is the default

    log off

    This command temporarily turns off the logging of output, so that any subsequent output is notcopied to the log file. This is useful if you want to save some of the output but not all. Log offonly works after a log using command.

    log on

    This command is used to restart the logging, copying any new output to the log file that wasalready defined. log on only works after a log using and a log off command.

    log close

    This command is used to turn off the logging and save the file. How are log off and log closedifferent? Log off allows you to turn it back on easily with log on continuing to use the samelog file. After a log close however, the only way to start logging again is with log using.

    set logtype text

    This command tells Stata to always save the log files in text (ASCII) format. It is the same asadding the text subcommand to every log using command, but it is easier. If you prefer textformat log files, this is the best way to make sure all the log files are in this format.

    set logtype smcl

    This command tells Stata to always save log files in SMCL format. It is the same as adding thesmcl subcommand to every log using command.

    Exercise 1: Exploring the ERHS

    This section includes some questions that you can answer using the r5ERHS files provided onyour computer and the commands described in this section. Remember two tricks to make iteasier to fix your mistakes:

    You can use PageUp to retrieve the most recent command. You can click on variables in the Variable window to paste it into the Command window.

  • 8/14/2019 EEA Stata Training Manual

    24/85

    23

    Summary file The file ERHScons1999 contains summary variables calculated from variousother data files. It is at the household level. Open the file by entering useC:\training\ERHScons1999.dta, clear in the Command window and pressing Return. Opendo and log files to save command and outputs. Use log file and copy and paste some of outputtables into excel and word files.

    1. How many variables and how many records are in ERHScons1999?2. What percentages of households have female heads?3. Is there a statistically significant difference between the percentage of female-headed

    households in poor and non-poor?4. What percentage of Amhara households are considered poor household?5. What percentages of households are in SNNP region?6. How does the percentage of female headed household vary by region?7. What is the average size of a household?8. What is the average size of household in the Oromia region?9. How does household size vary with across status? (use poor variable)

    Household members The file p1sec1_rv1 contains information about each member of thehousehold. It is at the individual level (each record is a person). You can answer the followingquestions using this file:

    1. What percentage of the individual is female?2. What percentage of the individual over 45 years old is female?3. What percentage of the individual under 5 is female?4. What percentage of women are married?5. What percentage of the women over the age of 18 are married?6. Does this percentage vary among regions?7. What is the status of individuals as compared to round 4?

    8. What is the reason for household who left since round 49. What was the major occupation of household head?10.What was the major occupation of household members aged 7 to 15?

    Food and cash cropsThe file p2s1b_rv1 contains information on production of food and cashcrops. The data are at the crop level, meaning that each record represents one crop for onehousehold. Only crops that are grown by each household are included in the file. The crop codesand labels are given in variable crop. You can answer the following questions with this file.

    1. How many households in the sample grow maize and wheat?

    2. Among maize growers, what was the average area with maize?3. Among maize growers, what was the average amount of maize harvested?4. Among wheat growers, what was the average amount of wheat harvested?5. Does the average amount of Maize harvested vary among regions?6. Does the average amount of Wheat harvested vary among regions?7. Among farmers with more than 1 hectare of maize, what was the average amount of

    maize harvested?8. What is the average amount harvested for major cereal crops? (Teff, barely, wheat, maize

    and sorghum?)

  • 8/14/2019 EEA Stata Training Manual

    25/85

    24

    9. Farmers were asked Was any of the land cultivated under new extension program?What was the average response?

    10.Farmers were also asked Was any of the land cultivated irrigated? And % of the landirrigated. Explore them.

    SECTION 5: CREATING NEW VARIABLES

    In the previous sections, we described how to explore the data using existing variables. In thissection, we discuss how to create new variables. When new variables are created, they are inmemory and they will appear in the Data Browser, but they will not be saved on the hard-diskunless you use the save command.

    In this section, we will cover the following commands and options.generatereplacetab , generateoperators

    functionsrecodextile

    generate

    This command is used to create a new variable. It is similar to compute in SPSS. The syntax is;

    generate newvar = exp [if exp]

    where exp is an expression like price*quant or 1000*kg. Several points about this command:

    Unlike compute in SPSS, generate cannot be used to change the definition of anexisting variable. If you want to change an existing variable, you need to use replace,

    You can use gen or g as an abbreviation for generate If the expression is an equality or inequality, the variable will take the values 0 if the

    expression is false and 1 if it is true If you use if, the new variable will have missing values when the if statement is false

    For example,

    generate age2 = age*age create age squared variable

    gen yield = outputkg/area if area>0 create new yield variable if area is positivegen price = value/quant if quant>0 create new price variable if quant is positivegen highprice = (price>1000) creates a dummy variable equal to 1 for high prices

  • 8/14/2019 EEA Stata Training Manual

    26/85

  • 8/14/2019 EEA Stata Training Manual

    27/85

    26

    Example 8: Using tab, gen to create dummy variables. t ab q1a, gen( r egi on)

    Regi on | Freq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Ti gray | 150 10. 33 10. 33Amhara | 466 32. 09 42. 42Or omi a | 396 27. 27 69. 70

    7 | 139 9. 57 79. 278 | 134 9. 23 88. 509 | 167 11. 50 100. 00

    - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00

    . t ab r egi on3

    q1a==Or omi a | Fr eq. Percent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    0 | 1, 056 72. 73 72. 731 | 396 27. 27 100. 00

    - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Total | 1, 452 100. 00

    egenThis is an extended version of generate[extended generate] to create a new variable byaggregating the existing data. It is a powerful and useful command that does not exist in SPSS. Itadds summary statistics to each observation. To do the same thing in SPSS, you would need tocreate a new file with aggregate and merge it with the original file using match files. Thesyntax is:

    egen newvar = fcn(arguments) [if exp] [in range] , by(var)

    where newvar is the new variable to be created; fcn is one of numerous functions such as:

    count() number of non-missing valuesdiff() compares variables, 1 if different, 0 otherwisefill() fill with a patterngroup() creates a group id from a list of variablesiqr() interquartile rangema() moving averagemax() maximum valuemean() mean

    median() medianmin() minimum valuepctile() percentilerank () rankrmean() mean across variablessd () standard deviationstd() standardize variablessum () sums

  • 8/14/2019 EEA Stata Training Manual

    28/85

    27

    argumentis normally just a variable var in the by() subcommand must be a categorical variable

    Here are some other examples:egen avg = mean(yield) creates variable of average yield over entire sampleegen avg2 = median(income), by(sex) creates variable of median income for each sexegen regprod = sum(prod), by(region) creates variable of total production for each region

    Example 9: Using egen to calculate averages. egen avecon=mean( cons) , by( q1c). gen hi ghavecon=( cons> avecon). l i st hhi d q1c cons avecon hi ghavecon i n 650/ 675

    +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +| hhi d q1c cons avecon hi ghav~n || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |

    650. | 407070000039 Si r bana Godet i 673. 582 940. 6532 0 |651. | 407070000040 Si r bana Godet i 793. 05 940. 6532 0 |652. | 407070000041 Si r bana Godet i 985. 257 940. 6532 1 |653. | 407070000042 Si r bana Godet i 844. 477 940. 6532 0 |

    654. | 407070000043 Si r bana Godet i 946. 014 940. 6532 1 || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |

    655. | 407070000044 Si r bana Godet i 2206. 057 940. 6532 1 |656. | 407070000045 Si r bana Godet i 570. 0535 940. 6532 0 |657. | 407070000046 Si r bana Godet i 1340. 926 940. 6532 1 |658. | 407070000047 Si r bana Godet i 901. 222 940. 6532 0 |659. | 407070000048 Si r bana Godet i 887. 775 940. 6532 0 |

    | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |660. | 407070000049 Si r bana Godet i 1026. 795 940. 6532 1 |661. | 407070000051 Si r bana Godet i 1392. 845 940. 6532 1 |662. | 407070000052 Si r bana Godet i 574. 218 940. 6532 0 |663. | 407070000053 Si r bana Godet i 363. 63 940. 6532 0 |664. | 407070000054 Si r bana Godet i 926. 551 940. 6532 0 |

    | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |665. | 407070000055 Si r bana Godet i 1256. 021 940. 6532 1 |666. | 407070000057 Si r bana Godet i 753. 478 940. 6532 0 |667. | 407070000058 Si r bana Godet i 1378. 575 940. 6532 1 |668. | 407070000059 Si r bana Godet i 1640. 834 940. 6532 1 |669. | 407070000060 Si r bana Godet i 472. 841 940. 6532 0 |

    | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |670. | 407070000062 Si r bana Godet i 721. 425 940. 6532 0 |671. | 407070000063 Si r bana Godet i 1341. 702 940. 6532 1 |672. | 407070000064 Si r bana Godet i 781. 82 940. 6532 0 |673. | 407070000065 Si r bana Godet i 1962. 697 940. 6532 1 |674. | 407070000070 Si r bana Godet i 945. 045 940. 6532 1 |

    | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |675. | 407070000071 Si r bana Godet i 1742. 247 940. 6532 1 |

    +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +

    In Example 9, we want to know which households have expenditure (cons) above the villageaverage. First, we calculate the average expenditure for each village with the egen command.Then we create a dummy variable based on the expression (cons > avecons). The list outputshows how the village average is repeated for every household in the village and confirms thatthe dummy variable is correctly calculated.

  • 8/14/2019 EEA Stata Training Manual

    29/85

    28

    operatorsThis is not a Stata command, but a topic related to creating new variables. Most of the operatorsare obvious, but some are not. Unlike SPSS, you cannot use words like or, and, eq, orgt.

    Arithmetic

    + addition

    - subtraction* multiplication/ division^ power

    Relational

    > greater than< less than>= more than or equal

  • 8/14/2019 EEA Stata Training Manual

    30/85

    29

    gen DDfemale = 0

    replace DDfemale = 1 if q1b==9 & sexh==0

    or an easier way to do this would be:

    gen DDfemale = (q1b==9 & sexh==0)

    Or suppose you wanted to create a dummy variable for households in the two regions (Amharaand Oromia). This variable can be created with:

    gen amaoro = 0

    replace amaoro = 1 if q1a==3 | q1a==4

    or by one command:

    gen amaoro = (q1a==3 | q1a==4)

    You can also combine conditions using parentheses. Suppose you wanted a dummy variable thatindicates if a household is a poor farmer in one of the Tigray and Amhara region. We will definepoor as in the bottom 20 percent and use the variable poor.

    gen PDF = ((q1a==1 | q1b==3) & poor==1)

    Note: Here is a list of some of the more commonly-used additional functions used to create newvariables in stata. Other functions can be found by typing help functions in the Stata Commandwindow.

    abs(x) computes the absolute value of xexp(x) calculates e to the x power.ln(x) computes the natural logarithm of xlog(x) is a synonym for ln(x), the natural logarithm.log10(x) computes the log base 10 of x.sqrt(x) computes the square root of x.invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z.normden(z) provides the standard normal density.normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not

    missing, otherwise, the result is missing.norm(z) provides the cumulative standard normal.

    group(x) creates a categorical variable that divides the data into x as nearly equal-sized subsamples as possible, numbering the first group 1, the secondgroup 2, etc. It uses the current order of the data.

    int(x) gives the integer obtained by truncating x.round(x,y) gives x rounded into units of y.

  • 8/14/2019 EEA Stata Training Manual

    31/85

    30

    recodeThis command changes the values of a categorical variable according to the rules specified. It islike the recode command in SPSS except that in Stata you do not necessarily use parentheses.The syntax is:

    recode varname old=new old=new . [if exp] [in range]

    Here are some examples:recode x 1=2 changes all values of x=1 to x= 2recode x 1=2 3=4 changes 1 to 2 and 3 to 4recode x 1=2 2=1 exchanges the values 1 and 2 in xrecode x 1=2 *=3 changes 1 in x to 2 and all other values to 3recode x 1/5=2 changes 1 through 5 in x to 2recode x 1 3 4 5 = 6 changes 1, 3, 4 and 5 to 6recode x .=9 changes missing to 9recode x 9=. changes 9 to missing

    Notice that you can use some special symbols in the rules:

    * means all other values. means missing valuesx/y means all values from x to yx y means x and y

    For example, recode region value 8 and 9 to 7

    Example 10: Using recode to define a new variable. t ab q1a

    Regi on | Freq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Ti gray | 150 10. 33 10. 33

    Amhara | 466 32. 09 42. 42Or omi a | 396 27. 27 69. 70

    7 | 139 9. 57 79. 278 | 134 9. 23 88. 509 | 167 11. 50 100. 00

    - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00

    . r ecode q1a 8 9=7( q1a: 301 changes made)

    . t ab q1a

    Regi on | Freq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ti gray | 150 10. 33 10. 33Amhara | 466 32. 09 42. 42Or omi a | 396 27. 27 69. 70

    7 | 440 30. 30 100. 00- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Total | 1, 452 100. 00

  • 8/14/2019 EEA Stata Training Manual

    32/85

    31

    xtile

    This command creates a new variable that indicates which category a record falls into, when thesample is sorted by an existing variable and divided into n groups of equal size. It is probablyeasier to explain with examples. xtile can be used to create a variable that indicates whichincome quintile a household belongs to, which decile in terms of farm size, or which tercile interms of coffee production. The syntax is:

    xtile newvar = variable [if exp] [in range] , nq(#)

    where newvar is the new categorical variable created; variable is the existing variable used tocreate the quantile (e.g income, farm size); # is the number of different categories (eg 5 forquintiles, 3 for terciles)

    For example,

    xtile incquint = income, nq(5)xtile farmdec = farmsize, nq(10)

    Suppose we want to create a variable indicating the deciles of expenditure per capita.

    Example 11: Using xtile to generate deciles (using the ERHS99cons data)

    . xt i l e r conseadec= r consae, nq( 10)

    . t ab r conseadec

    10 |quant i l es |

    of r consae | Fr eq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    1 | 145 10. 01 10. 012 | 145 10. 01 20. 013 | 145 10. 01 30. 024 | 145 10. 01 40. 035 | 145 10. 01 50. 036 | 145 10. 01 60. 047 | 145 10. 01 70. 058 | 145 10. 01 80. 069 | 145 10. 01 90. 06

    10 | 144 9. 94 100. 00- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Total | 1, 449 100. 00

    . t ab r conseadec sexh, col nof r e

    10 |quant i l es | Sex of househol d head

    of r consae | Femal e Mal e | Tot al- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -

    1 | 7. 79 10. 85 | 10. 012 | 10. 30 9. 90 | 10. 013 | 8. 04 10. 75 | 10. 014 | 10. 30 9. 90 | 10. 015 | 8. 79 10. 47 | 10. 016 | 10. 30 9. 90 | 10. 017 | 10. 55 9. 80 | 10. 018 | 10. 05 9. 99 | 10. 019 | 10. 05 9. 99 | 10. 01

    10 | 13. 82 8. 47 | 9. 94- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -

    Total | 100. 00 100. 00 | 100. 00

  • 8/14/2019 EEA Stata Training Manual

    33/85

    32

    Exercise 21. Use the file ERHScons1999. Create a variable called reg4 which indicates whether a

    household is in the Oromia or other regions. Then do a frequency table of the newvariable.

    2. Using the same file, create a variable called hhquint that indicates the quintile ofhousehold size. Then do a frequency table on the new variable.

    3. Using the same file, create a dummy variable called enbugthat is equal to 1 if thehousehold is the Enemayi and Bugena weredas and 0 otherwise. Then do a frequencytable on the new variable.

    4. Create a new variable avgexp which is equal to the wereda average of foodexpenditure (food). Then calculate a new variable equal to the difference between thehousehold food expenditure and the weredaaverage expenditure.

    5. Using the same file, create a new variable splot which is 1 if the person is cultivatingsingle plots and 0 otherwise.

    6. Use file p1sec1_rv1. Create a set of dummy variables called relatxx based on therelationship of the person to the household head. For example, relat01 is a dummy forbeing the head, relat02 is a dummy for being the spouse, relat03 for a child, and so on.

    SECTION 6: MODIFYING VARIABLES

    In this section, we introduce some more powerful and flexible commands for generating resultsfrom survey data. We begin with an explanation of how to label data in Stata. Then see how toformat variables. These are the topics and commands covered in this section:

    rename variablelabel variablelabel definelabel values

    format variable

    rename variablesThis command is used to rename variables in order to give other variable name. The command is

    . rename old_variable new_variable

    For instance, generate regional dummy variables and then:

    Example 12: renaming variable. t ab q1a, gen( i ndex)

    Regi on | Freq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Ti gray | 150 10. 33 10. 33Amhara | 466 32. 09 42. 42Or omi a | 396 27. 27 69. 70

    SNNP | 440 30. 30 100. 00- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Total | 1, 452 100. 00

  • 8/14/2019 EEA Stata Training Manual

    34/85

    33

    . t ab i ndex1

    q1a==Ti gray | Fr eq. Percent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    0 | 1, 302 89. 67 89. 671 | 150 10. 33 100. 00

    - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00

    . t ab i ndex2

    q1a==Amhara | Fr eq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    0 | 986 67. 91 67. 911 | 466 32. 09 100. 00

    - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00

    r ename i ndex1 Ti gray r ename i ndex1 vari abl e t o Ti gray

    r ename i ndex2 Amhara r ename i ndex2 var i abl e t o Amharar ename i ndex3 Or omi a r ename i nxex 3 var i abl e t o Or omi ar ename i ndex4 SNNP r ename i nxex4 var i abl e t o SNNP

    label variable

    This command is used to attach labels to variables in order to make the output easier tounderstand. For example, we know that Tigray is region1, SNNP are region 7. So we maywant to label the variables as follows:

    l abel var i abl e Ti gr ay"Regi on 1"l abel var i abl e Amhara"Regi on 3"

    l abel var i abl e Or omi a"Regi on 4l abel var i abel SNNP"Regi on 7"

    You can use the abbreviation label var If there are spaces in the label, you must use double quotation marks. If there are no spaces, quotation marks are optional. This command is like variable label in SPSS except that you can only label one variable per

    command and Stata uses double quotation marks, not single The limit is 80 characters for a label, but any labels over 30 characters will probably not look

    good in a table.

    label define

    This command gives a name to a set of value labels. For example, instead of numbering the regions, wecan assign a label to each region. Instead of numbering the different sources of income, we can give themlabels. The syntax is:

    label define lblname # "label" # "label" # label [, add modify]

    wherelblname is the name given to the set of value labels# are the value numbers

  • 8/14/2019 EEA Stata Training Manual

    35/85

    34

    labelare the value labelsadd means that you want to add these value labels to the existing setmodify means that you want to change these values in the existing set

    Note that:You can use the abbreviation label defThe double quotation marks are only necessary if there are spaces in the labels

    Stata will not let you define an existing label unless you say modify or addThis command is similar to value label in SPSS except that in Stata you give the labels a name

    and later attach it to the variable, while in SPSS you attach it to the variable in the same command.

  • 8/14/2019 EEA Stata Training Manual

    36/85

    35

    label valuesThis command attaches named set of value labels to a categorical variable. The syntax is:

    label values varname [lblname] [, nofix]

    where varname is the categorical variable which will get the labels lblname is a set of labels that havealready been defined by label define

    Here are some examples of labeling values in Stata.

    l abel def i ne r eg 1"Ti gr ay" 3"Amhara" 4"Or omi a" 7"SNNP", modi f yl abel val ues q1a reg

    Some additional commands that may be useful in labeling

    label dir to request a list of existing label nameslabel list to request a list of all the existing value labelslabel drop to delete a one or more labels

    label save using to save label definitions as a Do-filelabel data to give a label to a data file

    formatThe formatcommand allows you to specify the display format for variables. The internalprecision of the variables is unaffected.

    The syntax for format command is

    . format varlist %fmt

    where %fmt is listed below:

    %f mt descr i pt i on exampl e- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ri ght - j ust i f i ed f or mat s%#. #g general numer i c f or mat %9. 0g%#. #f f i xed numer i c f ormat %9. 2f%#. #e exponent i al numer i c f or mat %10. 7e%d def aul t numer i c el apsed dat e f ormat %d%d. . . user - speci f i ed el apsed date f ormat %dM/ D/ Y%#s st r i ng f ormat %15s

    Ri ght - j ust i f i ed, comma f or mat s%#. #gc gener al numer i c f ormat %9. 0gc%#. #f c f i xed numer i c f ormat %9. 2f c

    Leadi ng- zero f ormats%0#. #f f i xed numer i c f ormat %09. 2f%0#s st r i ng f or mat %015s

    Lef t - j usti f i ed f or mat s%- #. #g general numer i c f ormat %- 9. 0g%- #. #f f i xed numeri c f ormat %- 9. 2f%- #. #e exponent i al numer i c f ormat %- 10. 7e%- d def aul t numer i c el apsed dat e f ormat %- d

  • 8/14/2019 EEA Stata Training Manual

    37/85

    36

    %- d. . . user- speci f i ed el apsed dat e f ormat %- dM/ D/ Y%- #s st r i ng f ormat %- 15s

    Lef t - j ust i f i ed, comma f or mat s%- #. #gc gener al numer i c f ormat %- 9. 0gc%- #. #f c f i xed numeri c f ormat %- 9. 2f c

    Cent ered f ormats

    %~#s st r i ng f ormat ( speci al ) %~15s- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Exercise 31. Use exercise 2 and label values and variables for newly created variables2. label data file by This data is used for training3. list existing label names

    SECTION 7: ADVANCED DESCRIPTIVE STATISTICS

    In Section 3, we have seen at preliminary descriptive statistics mostly applied to explore thenature of the data. In this section we further explore more advanced statistics.

    tabulate summarize

    This command creates one- and two-way tables that summarize continuous variables. Thecommand tabulate by itself gives frequencies and percentages in each cell (cross-tabulations).With the summarize option, we can put means and other statistics of a continous variable. Thesyntax is:

    tabulate varname1 varname2 [if exp] [in range], summarize(varname3) options

    wherevarname1 is a categorical row variablevarname2 is a categorical column variable (optional)varname3 is the continuous variable summarized in each celloptions can be used to tell Stata which statistics you want

    Some notes regarding this command: The default statistics are the mean, the standard deviation, and the frequency. You can specify which statistics with options means, standard and freq You can use the abbreviation tabsum( )

    Some examples:

    tab q1a, sum(cons) gives the mean, std deviation, and frequency of per capitaexpenditure for each region

    tab q1b, sum(cons) mean gives the mean consumption for each villagetab q1a sexh, sum(food) gives the mean, std deviation, and frequency in each cell of

    hh head sex per region

  • 8/14/2019 EEA Stata Training Manual

    38/85

    37

    The first table is a one-way table (just one categorical variable) showing the mean, standarddeviation, and frequency of per capita expenditure for each expenditure region.

    In the second table, we use the mean option so only mean per capita expenditure is shown. In the third table, we add a second categorical variable (sexh) making it a two-way table.

    Although we could have requested all the the default statistics in the two-way table, it makes thetable difficult to read so we do not advise it.

    Example 13: Use tabulate. Sum () to generate tables. t ab q1a, sum( cons)

    | Summar y of consumpt i on per mont hRegi on | Mean St d. Dev. Fr eq.

    - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ti gray | 413. 93552 297. 701 149Amhara | 545. 91653 467. 28072 465Or omi a | 697. 09029 478. 55749 395

    SNNP | 331. 7384 221. 15601 440- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Total | 508. 51838 420. 4014 1449

    . t ab q1b, sum( cons) mean

    | Summar y of consumpt i on per monthWer eda | Mean

    - - - - - - - - - - - - +- - - - - - - - - - - -At sbi | 417. 16834

    Sebhassah | 409. 87Ankober | 301. 87563

    Basso na | 777. 31823Enemayi | 234. 392Bugena | 542. 38657

    Adaa | 940. 65322Ker sa | 567. 89355

    Dodot a | 526. 58473

    Shashemen | 775. 34926Cheha | 342. 54209

    Kedi da Ga | 239. 09955Bul e | 379. 28676

    Bol oso | 266. 93705Dar amal o | 416. 28045

    - - - - - - - - - - - - +- - - - - - - - - - - -Total | 508. 51838

    . t ab q1a sexh, sum( cons)

    Means, Standar d Devi at i ons and Fr equenci es of consumpt i on per mont h

    | Sex of househol d headRegi on | Femal e Mal e | Tot al- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -

    Ti gray | 342. 44136 488. 3678 | 413. 93552| 277. 62091 301. 46008 | 297. 701| 76 73 | 149

    - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -Amhara | 450. 61424 582. 89951 | 545. 91653

    | 368. 60452 495. 93838 | 467. 28072| 130 335 | 465

    - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -Or omi a | 610. 49528 728. 85178 | 697. 09029

  • 8/14/2019 EEA Stata Training Manual

    39/85

    38

    | 518. 32024 459. 98768 | 478. 55749| 106 289 | 395

    - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -SNNP | 271. 02927 346. 48695 | 331. 7384

    | 171. 91652 229. 33158 | 221. 15601| 86 354 | 440

    - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -Total | 433. 7347 536. 83799 | 508. 51838

    | 389. 69001 428. 24021 | 420. 4014| 398 1051 | 1449

    tabstatThis command gives summary statistics for a set of continuous variable for each value of acategorical variable. The syntax is:

    tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname)where

    varlist is a list of continuous variablesstatname is a type of statistic

    varname is a categorical variable

    Some facts about this command:

    The default statistic is the mean. Optional statistics subcommands include mean, sum, max, min, range, sd (standard deviation),

    var (variance), skewness, kurtosis, median, and pn (nth percentile). Without the by() option, tabstat is like summarize except that it allows you to specify the list of

    statistics to be displayed. With the by() option, tabstat is like "tabulate summarize except that tabstat is more flexible in

    the statistics and format

    It is very similar to the SPSS command means.

    Examples

    tabstat food hhsize, stats(mean max min) gives mean, max, and min of food &hhsize

    tabstat food hhsize, by(q1a) gives mean of two variables for each regiontabstat food, stats(median) by(q1a) gives the median food consumption for each

    regionThe tabstat command displays summary statistics for a series of numeric variables in a singletable.

  • 8/14/2019 EEA Stata Training Manual

    40/85

    39

    Example 14: Using tabstate to create Table. tabstat rconsae, s(mean p50 sd cv min max) by( rconseadec) missing

    Summary f or var i abl es: r consaeby cat egori es of : r conseadec ( 10 quant i l es of r consae)

    r conseadec | mean p50 sd cv mi n max- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    1 | 21. 80935 21. 9194 5. 773654 . 264733 4. 811201 30. 401752 | 36. 24088 36. 03099 3. 400392 . 0938275 30. 6191 42. 706213 | 48. 52454 48. 31921 3. 09388 . 0637591 42. 74319 53. 919974 | 60. 38483 60. 0903 3. 811244 . 0631159 54. 00354 66. 852295 | 73. 09496 72. 92955 3. 61339 . 0494342 66. 90016 79. 382066 | 89. 3758 89. 33151 5. 708862 . 0638748 79. 39233 99. 118717 | 110. 407 110. 2909 6. 692319 . 060615 99. 12563 122. 81868 | 137. 7846 137. 5525 9. 298181 . 0674835 123. 5698 154. 96669 | 179. 5007 176. 1209 17. 33479 . 0965723 155. 0732 214. 4674

    10 | 332. 2927 285. 4411 135. 2309 . 4069633 214. 4888 1212. 256. | . . . . . .

    - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 108. 7874 79. 38206 97. 27053 . 8941343 4. 811201 1212. 256

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    tableThis command creates a wide variety of tables. It is probably the most flexible and useful of allthe table commands in Stata. The syntax is:

    table rowvar colvar [if exp] [in range], c(clist) [row col]

    whererowvar is the categorical row variablecolvar is the categorical column variableclist is a list of statistic and variablesrow is an option to include a summary rowcol is an option to include a summary column

    Some useful facts about this command: The default statistic is the frequency. Optional statistics are mean, sd, sum, rawsum (unweighted), count, max, min, median, and pn

    (nth percentile). The c( ) is short for contents of each cell. Like tab, it can be used to create one- and two-way frequency tables, but table cannot do

    percentages Like tabsum, it can be used to calculate basic stats for each value of a categorical variable

    Its advantage over tabsum is that it can do more statistics and it can take more than onecontinious variable Like tabstat, it can be used to calculate advanced stats for each value of a categorical variable Its advantage over tabstat is that it can use do two (and more) way tables, but its disadvantage is

    that it has fewer statistics. It is similar to table in SPSS, but easier to learn and less flexible in formatting

    Here are some examples:

    table q1a , row table of frequencies by region with total row

  • 8/14/2019 EEA Stata Training Manual

    41/85

    40

    table q1a, c(mean income) table of average income by regiontable q1a, c(mean yield sd yield median yield) table of yield statistics by regiontable q1a, c(mean yield) format(%9.2f) table of average yields by region with

    format .table q1a sexh, c(mean yield) table of average yield by region and sextable q1a sexh, c(mean income mean yield) table of avg yield & income by region & sex

    Some output from table commands is shown in Example 15.

    The tablecommand calculates and displays tables of statistics, including frequency, mean,standard deviation, sum, and 1stto 99thpercentile. The rowand coloption specifies an additionalrow and column to be added to the table, reflecting the total across rows and columns.

    Example 15: Tabulate median real per capita consumption by region vs sex of household headtable q1a sexh, contents(p50 rconsae) row col missing

    | Sex of househol d headRegi on | Femal e Mal e Tot al

    - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ti gray | 73. 05909 74. 20448 73. 56232Amhara | 124. 9734 95. 00103 104. 7363Or omi a | 98. 59296 99. 43469 98. 75433

    SNNP | 53. 73735 50. 34177 51. 14911|

    Total | 90. 04483 77. 18623 79. 38206

    . t abl e r conseadec, c( mean r consae)

    10|quant i l es |o f |

    r consae | mean( r consae)- - - - - - - - - - +- - - - - - - - - - - - - -

    1 | 21. 809352 | 36. 240883 | 48. 524544 | 60. 384835 | 73. 094966 | 89. 37587 | 110. 4078 | 137. 78469 | 179. 5007

    10 | 332. 2927

    Exercise 41. Use ERHScons1999 and tabulate basic summery statistics showing mean, standarddeviation and frequency of per capita food consumption for each village. Interpret theresult.

    2. Repeat the same procedures as q1 but report only median of food consumption.3. Tabulate basic summery statistics for food consumption by sex of household head and

    regions (use single table)4. Tabulate mean 25p, median, 75p, sd, cv, min and max summery statistics for real food

    consumption per capita by deciles of real consumption per capita.

  • 8/14/2019 EEA Stata Training Manual

    42/85

    41

    5. Tabulate median real food consumption per capita by sex of household head and decilesof real consumption per capita (use single table).

    SECTION 8: PRESENTING DATA WITH GRAPH (GRAPHING DATA)

    This section provides a brief introduction to creating graphs. In Stata, all graphs are made withthe graph command, but there are 8 types of charts and numerous subcommands for controllingthe type and format of graph. In this section, we focus on four types of graph and a few options.

    The commands that draw graphs aregraph twoway scatterplots, line plots, etc.graph matrix scatterplot matricesgraph bar bar chartsgraph dot dot chartsgraph box box-and-whisker plotsgraph pie pie charts

    Graphcommands can also used to produce histogram, box plot, kdensity, P-P plot, Q-Q plot but

    we will postpone until the introduction of normality later. Let us first acquaint ourselves withsome twoway graph commands.

    A two way scatterplot can be drawn using (graph) twoway scatter command to show therelationship between two variables, cons (total consumption) and food (food consumption). Aswe would expect, there is a positive relationship between the two variables.

    . graph twoway scatter cons food

    0

    1000

    2000

    3000

    4000

    consumptionpermonth

    0 1000 2000 3000 4000

    food cons per month

    We can show the regression line predicting consfromfoodusing lfitoption.

    . twoway lfit cons food

  • 8/14/2019 EEA Stata Training Manual

    43/85

    42

    0

    1000

    2000

    3000

    4000

    Fittedvalues

    0 1000 2000 3000 4000food cons per month

    The two graphs can be overlapped like this

    . twoway (scatter cons hhsize) (lfit cons hhsize)

    0

    1000

    2

    000

    3000

    4000

    0 5 10 15 20household size

    consumption per month Fitted values

    Exercise 5:Draw two way scatter with line fit graph for consumption per capita vs household size andexplain its pattern.

  • 8/14/2019 EEA Stata Training Manual

    44/85

    43

    SECTION 9: NORMALITY AND OUTLIER

    Check for Normality

    An outlier is an observation that lies in an abnormal distance from other values in a randomsample from a population. We must be extremely mindful of possible outliers and their adverse

    effects during any attempt to measure the relationship between two continuous variables.

    There are no official rules to identify outliers. In a sense, this definition leaves it up to the analyst(or a consensus process) to decide what will be considered abnormal. Sometimes it is obviouswhen an outlier is simply miscoded (for example, age reported as 230) and hence should be set tomissing. But most times it is not the case.

    Before abnormal observations can be singled out, it is necessary to characterize normalobservations.

    Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, ordata set, is symmetric if it looks the same to the left and right of the center point. The skewnessfor a normal distribution is zero and any symmetric data should have a skewness near zero.Negative values for the skewness indicate data that are skewed left and positive values for theskewness indicate data that are skewed right. By skewing left, we mean that the left tail isheavier than the right tail. Similarly, skewing right means that the right tail is heavier than theleft tail.

    Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. Thatis, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly,and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather thana sharp peak. A uniform distribution would be the extreme case. The standard normaldistribution has a kurtosis of three. Positive kurtosis indicates a "peaked" distribution andnegative kurtosis indicates a "flat" distribution. A value of 6 or larger on the true kurtosisindicates a large departure from normality.

    We can obtain skewness and kurtosis values by using detail option in summarize command.Clearly, variable rconspc(real consumption per capita) is skewed to the right and has a peakeddistribution. Both statistics indicate the distribution of rconspcis far from normal.

    . sum r conspc

    Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    r conspc | 1449 90. 36742 81. 99623 4. 22011 1018. 295

  • 8/14/2019 EEA Stata Training Manual

    45/85

    44

    . sum r conspc, det ai l

    r eal consumpt i on per capi t a 1994 pr i ces- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Per cent i l es Smal l est1% 11. 65814 4. 220115% 18. 67906 6. 865227

    10% 25. 10425 7. 068164 Obs 1449

    25% 39. 94022 8. 201794 Sumof Wgt . 1449

    50% 65. 99258 Mean 90. 36742Largest Std. Dev. 81. 99623

    75% 114. 2533 577. 193790% 180. 8909 624. 1437 Var i ance 6723. 38295% 236. 1537 660. 1689 Skewness 3. 21231499% 405. 8775 1018. 295 Kur t osi s 21. 69683

    Besides commands for descriptive statistics, such as summarize, we can also check normality ofa variable visually by looking at some basic graphs in Stata, including histograms, boxplots,

    kdensity, pnorm, and qnorm. Lets keep using r conspc from ERHScons1999.dta file formaking some graphs.

    The histogram command is an effective graphical technique for showing both the skewness andkurtosis of r conspc.

    histogram rconspc

    0

    .002

    .004

    .006

    .008

    .01

    Density

    0 200 400 600 800 1000real consumption per capita 1994 prices

    The normal option can be used to get a normal overlay. This shows the skew to the right inrconspc.

  • 8/14/2019 EEA Stata Training Manual

    46/85

    45

    . histogram rconspc, normal

    0

    .002

    .004

    .006

    .008

    .01

    Dens

    ity

    0 200 400 600 800 1000real consumption per capita 1994 prices

    We can use the bin() option to increase the number of bins to 100. This better illustrates thedistribution of rconspc. This option specifies how to aggregate data into bins. Notice that thehistogram resembles a bell shape curve, but truncated at 0.

    . histogram rconspc, normal bin(100)

    0

    .002

    .004

    .0

    06

    .008

    .01

    Density

    0 200 400 600 800 1000real consumption per capita 1994 prices

    graph boxdraws vertical box plots. In a vertical box plot, the y axis is numerical, and the x axisis categorical. The upper and lower bounds of box are defined by the 25thand 75thpercentiles ofrconspc, and the line within the box is the median. The ends of the whiskers are 5 th and 95thpercentile ofrconspc. graph boxcommand can be used to produce a boxplot which can help usexamine the distribution of rconspc. If rconspcis normal, the median would be in the center ofthe box and the end of whiskers would be equidistant from the box.

  • 8/14/2019 EEA Stata Training Manual

    47/85

    46

    The boxplot forrconspcshows positive skew. The median is pulled to the low end of the box,and the 95thpercentile is stretched out away from the box, for both male and female hh head. Infact it seems worse for male household head.

    . graph box rconspc, by(sexh)

    0

    200

    400

    600

    800

    1,0

    00

    Female Male

    realcon

    sumptionpercapita1994prices

    Graphs by Sex of household head

    The kdensitycommand with the normal option displays a density graph of the residual with anormal distribution superimposed on the graph. This is particularly useful in verifying that theresiduals are normally distributed, which is a very important assumption for regression. The plotshows that rconspc is more skewed to the right and has a higher mean than that of normaldistribution.

    . kdensity rconspc, normal

    0

    .002

    .00

    4

    .006

    .008

    .01

    Density

    0 200 400 600 800 1000real consumption per capita 1994 prices

    Kernel density estimate

    Normal density

  • 8/14/2019 EEA Stata Training Manual

    48/85

    47

    Graphical alternatives to the kdensitycommand are the P-P plot and Q-Q plot.

    pnormcommand produces a P-P plot, which graphs a standardized normal probability. It shouldbe approximately linear if the variable follows normal distribution. The straighter the line formedby the P-P plot, the more the variable's distribution conforms to the normal distribution.

    . pnorm rconspc

    0.0

    0

    0.

    25

    0.5

    0

    0.7

    5

    1.0

    0

    NormalF[(rconspc-m)/s]

    0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)

    Qnormcommand plots the quantiles of a variable against the quantiles of a normal distribution.If the Q-Q plot shows a line that is close to the 45 degree line, the variable is more normallydistributed.

    . qnorm rconspc

    -500

    0

    500

    1000

    realcon

    sumptionpercapita1994prices

    -200 0 200 400Inverse Normal

  • 8/14/2019 EEA Stata Training Manual

    49/85

    48

    Both P-P and Q-Q plot prove that rconspcis not normal, with a long tail to the right. The qnormplot is more sensitive to deviances from normality in the tails of the distribution, where thepnormplot is more sensitive to deviances near the mean of the distribution.

    From the statistics and graphs we can confidently conclude that there exists outlier, especially atthe upper end of the distribution.

    Dealing with outliersThere are generally three ways to deal with outliers. The easiest is to delete them from analyses.The second one is to use measures that are not sensitive to them, such as median instead of mean,or transform the data to be more normal. The most complicated one is to replace them byimputation.

    Since our data is heavily right-tailed, we will focus on very large outliers. A customary criterionto identify outlier is to three times of deviation from the median. Note that we are using themedian because it is a robust statistic and if there are big outliers the mean will shift a lot but notthe median.

    Example 16: Using robust statistics to replace outliers/* Calculate number of standard deviations from median by sex of hh head */

    . use "C: \ . . \ t r ai ni ng\ ERHScons1999. dt a", cl ear

    . egen medi an=medi an( r conspc) , by ( sexh)

    . egen sd=sd( r conspc) , by ( sexh)

    . gen rat i o=( r conspc- medi an) / sd* ( 3 mi ssi ng val ues gener at ed). gen out l i er =1 i f r at i o>3 & r at i o~=.*( 1414 mi ssi ng val ues generat ed). r epl ace out l i er =0 i f out l i er ==. & r at i o~=.*( 1411 r eal changes made)

    . t abul at e out l i er , mi ssi ng

    out l i er | Freq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    0 | 1, 411 97. 18 97. 181 | 38 2. 62 99. 79. | 3 0. 21 100. 00

    - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00

    There are only 38 observations are identified as outliers. When we compare the mean andmedian values from using table command, the mean value has dropped around 5% and 14%

    among female and male headed households, respectively, while the medians are less sensitive tooutliers.

  • 8/14/2019 EEA Stata Training Manual

    50/85

    49

    Example 17: Comparing mean and median values to replace outliers. t abl e sexh out l i er , cont ent s( mean r conspc) r ow col mi ssi ng

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Sex of |househol d | out l i erhead | 0 1 Tot al- - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - -