lecture2-stata - pennsylvania state university · 7/7/16 5 stata interface 13 create new variables...

Post on 31-May-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

7/7/16

1

Lecture 2: Programming Statistics in Stata Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox

Review

•  Questions from last lecture? –  What is probability?

•  What is a probability distribution?

–  Types of data: continuous, categorical, binary •  Examples of each

–  What is a dependent variable? Independent variable? –  What is the appropriate statistical test for…

•  Comparing two means? Three means? Two proportions? Three proportions?

–  What does a p-value tell you?

2

Objectives

•  Introduce you to Stata Software and get you started summarizing your data

•  Give you basic code to start working with data •  Give you code to create new variables •  Use Stata to compute descriptive statistics •  Use Stata to perform univariate statistical tests (t

test, chi-square test, ANOVA) •  Use Stata to create basic graphs

7/7/16

2

Overview

•  Stata is software for performing data analysis •  Stata interface •  Common tasks in Stata

–  Import data –  Create new variables –  Summarize data (means, standard deviations,

histograms) –  Perform statistical tests –  Graphics

4

Objective

•  Most studies start with “Table 1”, which includes –  Summary statistics for the sample

•  Sample size •  Demographics •  Disease characteristics •  Treatment characteristics

–  Stratification by key outcome –  Statistical tests comparing characteristics by

stratification

•  Objective is to us Stata to create Table 1

6

7/7/16

3

Stata Interface

7

Commands

VariablesCommandHistory

Results

Stata Interface

•  There are two ways to interact with Stata –  Issue commands directly in the command window –  Write commands in a text file (called a “do” file in Stata

parlance) and send commands to the results window

•  Always use “do” files –  Creates permanent record of your work –  Can easily re-use large chunks of code

•  Occasionally use the command window

8

Stata Workflow

•  Import data into Stata •  Create new variables for analysis •  Deal with missing values •  Perform analyses

–  Tables, graphs

•  Move tabular results into Excel for formatting •  Save graphs as graphic files

9

7/7/16

4

Stata Commands

•  Stata syntax is usually a command, followed by variable names to apply to them, by restrictions on observations (if any) and then a comma followed by other options

command varlist if var==x, options

•  To execute a command, highlight the entire line of code and press: –  Windows: Ctrl+D –  Mac: Shift+Cmd+D

10

Importing Data

•  Stata can handle several types of raw data files •  Comma separated value (.csv) text files seem to

work best •  Can save Excel files as .csv files •  Start by pointing Stata to the folder where your

data file is stored using cd command (cd means change directory)

11

Importing Data

•  Command for importing data is: insheet

Mac/Unix: cd “~/projects/ltd/” Windows: cd “c:\projects\ltd\”

insheet using "ltd_data.csv"

12

7/7/16

5

Stata Interface

13

Create New Variables

•  Command to create new variables is generate •  Use in conjunction with replace

–  Generate creates a new variable and sets all to 0 –  Replace then sets the 1s

•  Use this to create binary dummy variables •  For example, if sex is coded “M” and “F”, we

need to create a male dummy and a female dummy

14

generate male=0 replace male=1 if sex==“M”

generate female=1-male

Example Data

7/7/16

6

Binary Data

•  Binary data should be coded as “dummy variables” •  Zeros and ones ONLY

Dummy Variables

•  We mentioned in the last lecture that we always use 0 and 1 to represent binary variables

•  These are called “dummy variables” –  Sometimes “binary indicators” for formality

•  Use the 1 to indicate the presence of the variable name

•  For example –  A “male” dummy variable would equal 1 for men and 0

for women –  A “died” dummy variable would equal 1 for patients who

died and 0 for patients who did not

Dummy Variables

•  For example, if sex is coded “Male” and “Female”, we need to create a male dummy and a female dummy

generate male=0 replace male=1 if sex==“Male”

generate female=1-male

7/7/16

7

Create New Variables

•  Notice that Stata differentiates between: –  Equals as assignment (male = 1) –  Equals as logical (if sex == “Male”)

•  This is the most common error you will make (besides misspelling) –  Get used to looking for this

19

Categorical Variables

•  Categorical data should be coded as dummy variables

•  One dummy variable for each category

Continuous Variables

•  Sometimes we make categorical variables out of continuous variables

•  Select cutpoints based on quartiles, then create 4 categories

7/7/16

8

Create New Variables

•  Example: Turn age from a continuous variable into four categories: 0-39, 40-49, 50-59, 60+

generate age039=0

replace age039=1 if age < 40 generate age4049=0

replace age4049=1 if age >= 40 & age < 50

generate age5059=0

replace age5059=1 if age >= 50 & age < 60

generate age60=0

replace age60=1 if age >= 60

22

Dealing with Missing Values

•  Most Stata procedures cannot be performed on observations with missing values

•  Missing numeric values are stored as a dot (“.”) •  Can refer to missing values in code by referring to

the dot

23

Dealing with Missing Values

•  There are two options for dealing with missing values 1.  Drop the observation altogether 2.  Create a category for missing values

•  Use option (1) if only a small proportion of observations are missing

•  Use option (2) if a relatively large proportions of observations are missing

7/7/16

9

Dropping Observations

•  Use the drop command to delete an observation with a missing value

•  For example, to drop patients where male is missing –  drop if male == .

Create a Missing Category

•  To create a missing category, generate a new category •  For example, if age has a missing value: generate age039=0

replace age039=1 if age < 40 generate age4049=0

replace age4049=1 if age >= 40 & age < 50

generate age5059=0

replace age5059=1 if age >= 50 & age < 60

generate age60=0

replace age60=1 if age >= 60

generate age_missing = 1 if age == .

Create a Missing Category

7/7/16

10

Errors with generate

•  Stata will not let you overwrite a data set or variable unintentionally

•  If you need to load a new data set, or reload your data set from text, run the clear command

•  If you make a mistake in your code after you generate a new variable –  Drop that variable –  Then run your generate command again

28

Creating Subsets

•  Stata holds a single data set in RAM at a time •  To create a subset of observations

–  All men, all adults, patients with complete data, etc.

•  Use drop if command to drop all other observations

•  Use keep if command to keep only observations of interest

29

Creating Subsets

•  To look only at male patients the following commands produce equivalent results:

keep if male==1

drop if male==0

30

7/7/16

11

Dropping Variables

•  To remove unwanted variables, use the drop command (without the if)

•  For example, if we wanted to remove the sex (M or F) variable after creating male and female dummy variables, use

drop sex

31

Saving Data Sets

•  After you have completed all data manipulations •  Save the data set as native Stata data set •  Command is save

save ltd, replace

•  Must use the replace option if you save more than once (which is always!)

32

Loading Saved Data Sets

•  To load a Stata data set that you have previously saved, the command is use

clear

use ltd

•  Always start with clear

–  Stata will not let you overwrite data unintentionally

33

7/7/16

12

Summarizing Data: Tables

•  Use tabulate to produce a simple summary of counts of elements in the variable –  For example, tabulate female

female | Freq. Percent Cum.

------------+-----------------------------------

0 | 419 53.93 53.93

1 | 358 46.07 100.00

------------+----------------------------------- Total | 777 100.00

34

Summarizing Data: Tables

•  Can also get a cross-tabulation by listing two variables: tabulate female ssi, row col

35

| ssi female | 0 1 | Total -----------+----------------------+---------- 0 | 246 173 | 419 | 58.71 41.29 | 100.00 | 50.72 59.25 | 53.93 -----------+----------------------+---------- 1 | 239 119 | 358 | 66.76 33.24 | 100.00 | 49.28 40.75 | 46.07 -----------+----------------------+---------- Total | 485 292 | 777 | 62.42 37.58 | 100.00 | 100.00 100.00 | 100.00

Summarizing Data: Tables

•  To obtain a table of basic summary statistics, use the command summarize

•  For example –  summarize age female male black nonblack

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age | 777 44.91645 17.29986 .2575342 77.53151 male | 777 .5392535 .4987778 0 1 female | 777 .4607465 .4987778 0 1 black | 777 .043758 .2046881 0 1 nonblack | 777 .956242 .2046881 0 1 -------------+--------------------------------------------------------

36

7/7/16

13

Exporting Summaries

•  Stata output is generally not aesthetically pleasing enough to place directly into papers

•  Best to move summary data into Excel for formatting, then copy to paper

•  To move summary data, highlight desired table, right click, and select Copy Table

•  This will paste into Excel cells

37

Analyst Pro Tip!!

•  To create tables for publications, copy the raw data from Stata into Excel, but DO NOT FORMAT IT

•  Instead, create a formatted table next to the raw table and use formulas to create a clean, publication quality table

•  This tip will save you mounds of time –  It will probably get you tenure

38

Copy Table in Stata

7/7/16

14

Statistical Tests

•  Suppose you want to know whether patients with SSI were older than those without an SSI

•  To perform a t test: ttest depvar, by(indepvar) –  Example: ttest age, by(ssi)

40

Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 | 485 45.82019 .7510179 16.53945 44.34453 47.29585 1 | 292 43.41537 1.078256 18.42524 41.2932 45.53754 ---------+-------------------------------------------------------------------- combined | 777 44.91645 .6206292 17.29986 43.69814 46.13476 ---------+-------------------------------------------------------------------- diff | 2.404821 1.279332 -.1065454 4.916186 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = 1.8797 Ho: diff = 0 degrees of freedom = 775 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.9697 Pr(|T| > |t|) = 0.0605 Pr(T > t) = 0.0303

Statistical Tests

•  Supposed you want to know whether patients with SSI have a higher mortality rate

•  To perform a chi-square test:

tabulate depvar indepvar, row col chi2

Example: tabulate died ssi, row col chi2

41

+-------------------+ | Key | |-------------------| | frequency | | row percentage | | column percentage | +-------------------+ | ssi died | 0 1 | Total -----------+----------------------+---------- 0 | 407 227 | 634 | 64.20 35.80 | 100.00 | 83.92 77.74 | 81.60 -----------+----------------------+---------- 1 | 78 65 | 143 | 54.55 45.45 | 100.00 | 16.08 22.26 | 18.40 -----------+----------------------+---------- Total | 485 292 | 777 | 62.42 37.58 | 100.00 | 100.00 100.00 | 100.00 Pearson chi2(1) = 4.6322 Pr = 0.031

Statistical Tests •  Assume you want to know whether LOS differs

across patients with better matched organs •  To perform an ANOVA: anova depvar indepvar

Example: anova los abmm

42

Number of obs = 777 R-squared = 0.0168 Root MSE = 29.2585 Adj R-squared = 0.0118 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 11325.344 4 2831.33599 3.31 0.0106 | abmm | 11325.344 4 2831.33599 3.31 0.0106 | Residual | 660879.608 772 856.061669 -----------+---------------------------------------------------- Total | 672204.952 776 866.243495

7/7/16

15

Stata Graphics

•  Stata has excellent facilities for graphics •  The overall look and feel of a Stata graph is

determined by a “scheme” –  Schemes are predefined graphics parameters that

determine all aspects of the graph

•  To see what schemes are available: graph query, schemes

•  To set a scheme: set scheme economist

43

Histogram

•  To obtain a basic histogram of varname, type: histogram varname

•  For example, a histogram of age

44

Schemes Again

•  Change scheme to Economist: set scheme economist

45

7/7/16

16

Schemes

•  sj scheme

46

Graphics Options

•  There are options for –  Adding a title (title) –  Altering the scale of the axes (xscale, yscale) –  Specifying what axis labels to use (xtitle, ytitle) –  Changing the markers used (msymbol)

•  For example, to finish our histogram histogram age, title(“Distribution of Age”) xtitle(“Age at Transplant”) ytitle(“Density”)

47

Final Histogram

48

7/7/16

17

Scatterplot

•  To display a scatterplot of two (or more) variables, type: scatter costs los

•  Other options apply

49

50

Exporting Graphs

•  To export your graph use –  graph export filename.ext, as(type)

•  Use pdf or eps for your graphs: –  graph export graph1.pdf, as(pdf) –  graph export graph1.eps, as(eps)

51

7/7/16

18

Homework

•  Get the liver transplant data set from the website •  Reproduce the table on the next slide

–  Generate the numbers using the summarize command –  Move the data into Excel –  Recreate table in Excel –  Perform statistical tests and insert p-values

•  Reproduce the graph on the following slide

52

53

54

top related