lecture2-stata - pennsylvania state university · 7/7/16 5 stata interface 13 create new variables...

18
7/7/16 1 Lecture 2: Programming Statistics in Stata Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review Questions from last lecture? What is probability? What is a probability distribution? Types of data: continuous, categorical, binary Examples of each What is a dependent variable? Independent variable? What is the appropriate statistical test for… Comparing two means? Three means? Two proportions? Three proportions? What does a p-value tell you? 2 Objectives Introduce you to Stata Software and get you started summarizing your data Give you basic code to start working with data Give you code to create new variables Use Stata to compute descriptive statistics Use Stata to perform univariate statistical tests (t test, chi-square test, ANOVA) Use Stata to create basic graphs

Upload: others

Post on 31-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

1

Lecture 2: Programming Statistics in Stata Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox

Review

•  Questions from last lecture? –  What is probability?

•  What is a probability distribution?

–  Types of data: continuous, categorical, binary •  Examples of each

–  What is a dependent variable? Independent variable? –  What is the appropriate statistical test for…

•  Comparing two means? Three means? Two proportions? Three proportions?

–  What does a p-value tell you?

2

Objectives

•  Introduce you to Stata Software and get you started summarizing your data

•  Give you basic code to start working with data •  Give you code to create new variables •  Use Stata to compute descriptive statistics •  Use Stata to perform univariate statistical tests (t

test, chi-square test, ANOVA) •  Use Stata to create basic graphs

Page 2: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

2

Overview

•  Stata is software for performing data analysis •  Stata interface •  Common tasks in Stata

–  Import data –  Create new variables –  Summarize data (means, standard deviations,

histograms) –  Perform statistical tests –  Graphics

4

Objective

•  Most studies start with “Table 1”, which includes –  Summary statistics for the sample

•  Sample size •  Demographics •  Disease characteristics •  Treatment characteristics

–  Stratification by key outcome –  Statistical tests comparing characteristics by

stratification

•  Objective is to us Stata to create Table 1

6

Page 3: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

3

Stata Interface

7

Commands

VariablesCommandHistory

Results

Stata Interface

•  There are two ways to interact with Stata –  Issue commands directly in the command window –  Write commands in a text file (called a “do” file in Stata

parlance) and send commands to the results window

•  Always use “do” files –  Creates permanent record of your work –  Can easily re-use large chunks of code

•  Occasionally use the command window

8

Stata Workflow

•  Import data into Stata •  Create new variables for analysis •  Deal with missing values •  Perform analyses

–  Tables, graphs

•  Move tabular results into Excel for formatting •  Save graphs as graphic files

9

Page 4: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

4

Stata Commands

•  Stata syntax is usually a command, followed by variable names to apply to them, by restrictions on observations (if any) and then a comma followed by other options

command varlist if var==x, options

•  To execute a command, highlight the entire line of code and press: –  Windows: Ctrl+D –  Mac: Shift+Cmd+D

10

Importing Data

•  Stata can handle several types of raw data files •  Comma separated value (.csv) text files seem to

work best •  Can save Excel files as .csv files •  Start by pointing Stata to the folder where your

data file is stored using cd command (cd means change directory)

11

Importing Data

•  Command for importing data is: insheet

Mac/Unix: cd “~/projects/ltd/” Windows: cd “c:\projects\ltd\”

insheet using "ltd_data.csv"

12

Page 5: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

5

Stata Interface

13

Create New Variables

•  Command to create new variables is generate •  Use in conjunction with replace

–  Generate creates a new variable and sets all to 0 –  Replace then sets the 1s

•  Use this to create binary dummy variables •  For example, if sex is coded “M” and “F”, we

need to create a male dummy and a female dummy

14

generate male=0 replace male=1 if sex==“M”

generate female=1-male

Example Data

Page 6: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

6

Binary Data

•  Binary data should be coded as “dummy variables” •  Zeros and ones ONLY

Dummy Variables

•  We mentioned in the last lecture that we always use 0 and 1 to represent binary variables

•  These are called “dummy variables” –  Sometimes “binary indicators” for formality

•  Use the 1 to indicate the presence of the variable name

•  For example –  A “male” dummy variable would equal 1 for men and 0

for women –  A “died” dummy variable would equal 1 for patients who

died and 0 for patients who did not

Dummy Variables

•  For example, if sex is coded “Male” and “Female”, we need to create a male dummy and a female dummy

generate male=0 replace male=1 if sex==“Male”

generate female=1-male

Page 7: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

7

Create New Variables

•  Notice that Stata differentiates between: –  Equals as assignment (male = 1) –  Equals as logical (if sex == “Male”)

•  This is the most common error you will make (besides misspelling) –  Get used to looking for this

19

Categorical Variables

•  Categorical data should be coded as dummy variables

•  One dummy variable for each category

Continuous Variables

•  Sometimes we make categorical variables out of continuous variables

•  Select cutpoints based on quartiles, then create 4 categories

Page 8: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

8

Create New Variables

•  Example: Turn age from a continuous variable into four categories: 0-39, 40-49, 50-59, 60+

generate age039=0

replace age039=1 if age < 40 generate age4049=0

replace age4049=1 if age >= 40 & age < 50

generate age5059=0

replace age5059=1 if age >= 50 & age < 60

generate age60=0

replace age60=1 if age >= 60

22

Dealing with Missing Values

•  Most Stata procedures cannot be performed on observations with missing values

•  Missing numeric values are stored as a dot (“.”) •  Can refer to missing values in code by referring to

the dot

23

Dealing with Missing Values

•  There are two options for dealing with missing values 1.  Drop the observation altogether 2.  Create a category for missing values

•  Use option (1) if only a small proportion of observations are missing

•  Use option (2) if a relatively large proportions of observations are missing

Page 9: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

9

Dropping Observations

•  Use the drop command to delete an observation with a missing value

•  For example, to drop patients where male is missing –  drop if male == .

Create a Missing Category

•  To create a missing category, generate a new category •  For example, if age has a missing value: generate age039=0

replace age039=1 if age < 40 generate age4049=0

replace age4049=1 if age >= 40 & age < 50

generate age5059=0

replace age5059=1 if age >= 50 & age < 60

generate age60=0

replace age60=1 if age >= 60

generate age_missing = 1 if age == .

Create a Missing Category

Page 10: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

10

Errors with generate

•  Stata will not let you overwrite a data set or variable unintentionally

•  If you need to load a new data set, or reload your data set from text, run the clear command

•  If you make a mistake in your code after you generate a new variable –  Drop that variable –  Then run your generate command again

28

Creating Subsets

•  Stata holds a single data set in RAM at a time •  To create a subset of observations

–  All men, all adults, patients with complete data, etc.

•  Use drop if command to drop all other observations

•  Use keep if command to keep only observations of interest

29

Creating Subsets

•  To look only at male patients the following commands produce equivalent results:

keep if male==1

drop if male==0

30

Page 11: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

11

Dropping Variables

•  To remove unwanted variables, use the drop command (without the if)

•  For example, if we wanted to remove the sex (M or F) variable after creating male and female dummy variables, use

drop sex

31

Saving Data Sets

•  After you have completed all data manipulations •  Save the data set as native Stata data set •  Command is save

save ltd, replace

•  Must use the replace option if you save more than once (which is always!)

32

Loading Saved Data Sets

•  To load a Stata data set that you have previously saved, the command is use

clear

use ltd

•  Always start with clear

–  Stata will not let you overwrite data unintentionally

33

Page 12: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

12

Summarizing Data: Tables

•  Use tabulate to produce a simple summary of counts of elements in the variable –  For example, tabulate female

female | Freq. Percent Cum.

------------+-----------------------------------

0 | 419 53.93 53.93

1 | 358 46.07 100.00

------------+----------------------------------- Total | 777 100.00

34

Summarizing Data: Tables

•  Can also get a cross-tabulation by listing two variables: tabulate female ssi, row col

35

| ssi female | 0 1 | Total -----------+----------------------+---------- 0 | 246 173 | 419 | 58.71 41.29 | 100.00 | 50.72 59.25 | 53.93 -----------+----------------------+---------- 1 | 239 119 | 358 | 66.76 33.24 | 100.00 | 49.28 40.75 | 46.07 -----------+----------------------+---------- Total | 485 292 | 777 | 62.42 37.58 | 100.00 | 100.00 100.00 | 100.00

Summarizing Data: Tables

•  To obtain a table of basic summary statistics, use the command summarize

•  For example –  summarize age female male black nonblack

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age | 777 44.91645 17.29986 .2575342 77.53151 male | 777 .5392535 .4987778 0 1 female | 777 .4607465 .4987778 0 1 black | 777 .043758 .2046881 0 1 nonblack | 777 .956242 .2046881 0 1 -------------+--------------------------------------------------------

36

Page 13: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

13

Exporting Summaries

•  Stata output is generally not aesthetically pleasing enough to place directly into papers

•  Best to move summary data into Excel for formatting, then copy to paper

•  To move summary data, highlight desired table, right click, and select Copy Table

•  This will paste into Excel cells

37

Analyst Pro Tip!!

•  To create tables for publications, copy the raw data from Stata into Excel, but DO NOT FORMAT IT

•  Instead, create a formatted table next to the raw table and use formulas to create a clean, publication quality table

•  This tip will save you mounds of time –  It will probably get you tenure

38

Copy Table in Stata

Page 14: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

14

Statistical Tests

•  Suppose you want to know whether patients with SSI were older than those without an SSI

•  To perform a t test: ttest depvar, by(indepvar) –  Example: ttest age, by(ssi)

40

Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 | 485 45.82019 .7510179 16.53945 44.34453 47.29585 1 | 292 43.41537 1.078256 18.42524 41.2932 45.53754 ---------+-------------------------------------------------------------------- combined | 777 44.91645 .6206292 17.29986 43.69814 46.13476 ---------+-------------------------------------------------------------------- diff | 2.404821 1.279332 -.1065454 4.916186 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = 1.8797 Ho: diff = 0 degrees of freedom = 775 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.9697 Pr(|T| > |t|) = 0.0605 Pr(T > t) = 0.0303

Statistical Tests

•  Supposed you want to know whether patients with SSI have a higher mortality rate

•  To perform a chi-square test:

tabulate depvar indepvar, row col chi2

Example: tabulate died ssi, row col chi2

41

+-------------------+ | Key | |-------------------| | frequency | | row percentage | | column percentage | +-------------------+ | ssi died | 0 1 | Total -----------+----------------------+---------- 0 | 407 227 | 634 | 64.20 35.80 | 100.00 | 83.92 77.74 | 81.60 -----------+----------------------+---------- 1 | 78 65 | 143 | 54.55 45.45 | 100.00 | 16.08 22.26 | 18.40 -----------+----------------------+---------- Total | 485 292 | 777 | 62.42 37.58 | 100.00 | 100.00 100.00 | 100.00 Pearson chi2(1) = 4.6322 Pr = 0.031

Statistical Tests •  Assume you want to know whether LOS differs

across patients with better matched organs •  To perform an ANOVA: anova depvar indepvar

Example: anova los abmm

42

Number of obs = 777 R-squared = 0.0168 Root MSE = 29.2585 Adj R-squared = 0.0118 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 11325.344 4 2831.33599 3.31 0.0106 | abmm | 11325.344 4 2831.33599 3.31 0.0106 | Residual | 660879.608 772 856.061669 -----------+---------------------------------------------------- Total | 672204.952 776 866.243495

Page 15: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

15

Stata Graphics

•  Stata has excellent facilities for graphics •  The overall look and feel of a Stata graph is

determined by a “scheme” –  Schemes are predefined graphics parameters that

determine all aspects of the graph

•  To see what schemes are available: graph query, schemes

•  To set a scheme: set scheme economist

43

Histogram

•  To obtain a basic histogram of varname, type: histogram varname

•  For example, a histogram of age

44

Schemes Again

•  Change scheme to Economist: set scheme economist

45

Page 16: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

16

Schemes

•  sj scheme

46

Graphics Options

•  There are options for –  Adding a title (title) –  Altering the scale of the axes (xscale, yscale) –  Specifying what axis labels to use (xtitle, ytitle) –  Changing the markers used (msymbol)

•  For example, to finish our histogram histogram age, title(“Distribution of Age”) xtitle(“Age at Transplant”) ytitle(“Density”)

47

Final Histogram

48

Page 17: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

17

Scatterplot

•  To display a scatterplot of two (or more) variables, type: scatter costs los

•  Other options apply

49

50

Exporting Graphs

•  To export your graph use –  graph export filename.ext, as(type)

•  Use pdf or eps for your graphs: –  graph export graph1.pdf, as(pdf) –  graph export graph1.eps, as(eps)

51

Page 18: Lecture2-Stata - Pennsylvania State University · 7/7/16 5 Stata Interface 13 Create New Variables • Command to create new variables is generate • Use in conjunction with replace

7/7/16

18

Homework

•  Get the liver transplant data set from the website •  Reproduce the table on the next slide

–  Generate the numbers using the summarize command –  Move the data into Excel –  Recreate table in Excel –  Perform statistical tests and insert p-values

•  Reproduce the graph on the following slide

52

53

54