computing for research i spring 2013 primary instructor: elizabeth garrett-mayer introduction to...

35
Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Upload: duane-kelly

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Computing for Research ISpring 2013

Primary Instructor: Elizabeth Garrett-Mayer

Introduction to StataFebruary 7

Page 2: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Stata

• Stata is a powerful statistical package with – smart data-management facilities– a wide array of up-to-date statistical techniques, – an excellent system for producing publication-quality graphs.

• Stata is fast and easy to use• Current version is Stata 12.• Stata vs. Stata SE

– “standard” stata can handle up to 2047 variables– SE can handle 32767 variables– Number of observations is limited by your computer (up to 2

billion!)

Page 3: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Stata Interface

• Multiple Windows– Results– Review– Variables– Command

• Other windows– Data editor– Data viewer– Log– ‘do’– graph

Page 4: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Stata Interface

• Customizable windows• Can be resized• Edits to preferences are ‘remembered’• You can save (then load) different preferences.• Command line driven• But more recently, drop-down menu

Page 5: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Important Details• case sensitive!• return means ‘run’. there is no little running man to click.• you cannot run commands if your data editor is open• you need to ‘clear’ data before you bring in more data• you can only have one dataset active at a time• Save yourself some typing (and errors)

– Utilize the variables window– Utilize the ‘review’ window

• abbreviations work for commands and variable names!– d instead of describe– case instead of caseid– NOT always, but if they uniquely identify variable name or command, they should– Also true for some options.– See Stata help files for how short you can go on abbreviations

Page 6: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Help!!

• The most important part• Two interactive options:– help ‘command’– help ‘search’

• Also LARGE pdfs that link from help files• Plus:– advice– link to Stata– command line help– findit

Page 7: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

No data?

• There are lots of things you can do without data in stata!

• “immediate” commands– An immediate command is a command that

obtains data not from the data stored in memory but from numbers typed as arguments.

– Immediate commands, in effect, turn Stata into a glorified hand calculator.

Page 8: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Some immediate commands bitesti Binomial probability test cci csi Tables for epidemiologists; see [ST] epitab iri mcci cii Confidence intervals for means, proportions, counts prtesti One- and two-sample tests of proportions sampsi Sample size and power determination sdtesti Variance comparison tests symmi Symmetry and marginal homogeneity tests tabi One- and two-way tables of frequencies ttesti Mean comparison tests display Displays simple calculations

see ‘help immediate’ for more information

Page 9: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Some examples

display 4.1–1.96*0.3

tabi 100 34 \ 17 294tabi 100 34 \ 17 294, coltabi 100 34 \ 17 294, col row cell chi

cci 100 34 17 294cci 100 34 17 294, exact

sampsi 0.2 0.5

Page 10: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Some examples

bitesti 100 40 0.40

ttesti 100 4 4 100 6 7ttesti 100 4 4 100 6 7, uneqttesti 100 4 8 7

Page 11: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

But most of the time, we have datasets

• *.dta files are stata datasets• To open:

– Option 1: use the “use” command:• use "I:\MUSC Oncology\Cunningham, Joan\June2007\SCbcdata.dta“

– Option 2: menu-driven open• File Open…

• If you use Option 2, the associated command will appear in your results window AND in your review window

• If you use Option 2, consider cutting and pasting command into your ‘do’ file for next time..

Page 12: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Other types of data?

• Stata can import– ASCII files– Sas export– and a few others (that I have never heard of)

• Two options:– menu-driven: File Import….– insheet command can be used for ascii files

• insheet using sampledata.csv, comma• insheet using sampledata.csv, tab

– for insheet, you can use any separator (use delimiter(“char”) option)

Page 13: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Two notes on opening files

• if you use command line, you will have to either add clear at the end of the line to clear a current data set, or type clear as a command prior to opening the new dataset– insheet using sampledata.csv, comma clearOR– clear– insheet using sampledata.csv, comma

• you can use the cd command to tell Stata where to browse for your file(s), instead of giving long path names. This is particularly helpful if you are merging files from the same directory– cd “I:\Classes\StatComputingI”

Page 14: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Example: SC breast cancer registry data from 2004

• All diagnoses of breast cancer in SC are recorded

• Small version for class: N = 2633; 55 variables• Demographic and clinical information recorded• Let’s read it in and explore it– use cd– use insheet– use ‘use’

Page 15: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Exploring your dataset• describe (can be abbreviated ‘d’)

– a very good idea to make sure things look right– tell you about types of variables, number of observations and number of variables

• codebook– summary per variable– useful for seeing number of uniques and missings

• sum– statistical summary (N, mean, SD, etc.)– only works on numerically coded variables– sum, detail

• inspect– similar to codebook. – provides rough histogram and neg, pos, missing

• Note: – all of these can be used with or without a varlist (e.g. sum race age)– to ‘quit’ a long command, type ‘q’ and it will stop sending output to results window

Page 16: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Exploring your dataset

• Open dataset in editor or browser• Difference? edit capabilities• Allows you to sort• Variables manager (can access from viewer or

main toolbar)– allows you to add labels simply– includes coding

Page 17: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Exploring

• Categorical variables can be summarized using tabulate (tab) or tabulate– tab race– table race

• list can help with a small dataset, or to look at a subset of the dataset– list race age if age<30

• Can also sort at command line– sort age

Page 18: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Interactive command line driven?

• Well, there is a little running man, afterall!• GOOD PROGRAMMING PRACTICE:– open a ‘do’ file– enter all of your commands in the do file– you can select one or more to run at a time– SAVE your do file!!!

• Window Do File Editor• how to include comments? * or /*…*/

* this is how we can make a table of race and ERtab race ercat/* our table looks very nice.we should really make pretty tables all the time */

Page 19: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Do file of our commands so far* slide 14: reading in datacd "I:\Classes\StatComputingI"insheet using "SCBC2004.csv", commaclearuse SCBC2004.dta

* slide 15: exploring our dataset* use d or describedd ercatcodebook codebook dodyrsumsum ercatcodebook ercat

* slide 17: more explorationtab racetable racelist race age if age<30sort age

Page 20: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

What about the output?

• Sometimes you want to have a file that shows the results• Useful to share with investigators(?)• Nice to have output saved• My preference? keep a really good ‘do’ file and rerun it.• Log file setup steps:

– File Log Begin– analyze data, etc.– File Log Suspend (or End)

• Options for text (.log) or formatted (.smcl) files– *.log can be opened in text editor– *.smcl can only be opened in stata but looks nicer (and can be

printed from stata)

Page 21: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Getting stuff out of Stata

• Stata can be good for data management• I prefer it to R– step 1: data management in Stata– step 2: write ‘clean’ file from Stata to csv– step 3: read clean file into R

• Exporting:– menu-driven: File Export– command line:

outsheet [varlist] using “file.csv”, comma**for command line, may need “replace” as an option if you already have a file of the same name you want to replace.

Page 22: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Saving Stata Data

• File Save or Save as• Command line:– save “filename”, replace– save filename– save filename.dta– .dta will be added– replace may be needed or not

Page 23: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

What if you don’t want to save or export everything?

• You can use keep and drop commands to keep or drop observations or variables before exporting/saving

• Want analyze ER, PR status, stage, age and grade in African American women.– drop if race==1– keep ercat prcat stagen age grade

• These observations and variables are GONE from Stata’s memory

• If you want them back, you need to reload the original data• BE CAREFUL: do NOT drop variables or observations and then

overwrite original data!• You can also include a ‘varlist’ with the outsheet command

Page 24: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Other options for subsetting

• by: performs command by categories– by race, sort: sum age– bysort ercat prcat: sum age

• if: performs command in a category/range– tab ercat if stagen>1– tab ercat if graden~=.

• Combine them:– bysort ercat prcat: sum age if ercat<9 & prcat<9

Page 25: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Working with variables

• new variables can be created with the ‘generate’ command (or just ‘gen’)

• Example: grade has 4 levels. tab graden graden | Freq. Percent Cum.------------+----------------------------------- 1 | 468 19.45 19.45 2 | 916 38.07 57.52 3 | 941 39.11 96.63 4 | 81 3.37 100.00------------+----------------------------------- Total | 2,406 100.00

• We want to create high vs low grade variable

Page 26: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Several approaches

• gen highgrade = 1 if graden>2• replace highgrade = 0 if graden<3

• gen highgrade=cond(graden>2,1,0)• replace highgrade = . if graden==.

• Note well: Check coding of missing values!!

Page 27: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Extensions to generate

• ‘egen’• Same example: egen has a function ‘cut’

that can cut a continuous variable at a list of breakpoints:

• categories are defined by < each breakpoint

egen highgrade=cut(graden), at(-1,3,5)egen highgrade=cut(graden), at(-1,3,5) icodes

Page 28: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

generate

• use it for transformations – gen y = log(x)– gen y = x^2

• generate random variables– gen z1 = uniform() *uniform(0,1)– gen z2 = 2 + 2*runiform() *uniform(2,4)

• generate ascending observation id by county– gen id= _n– bysort county: gen countyid=_n

Page 29: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Example of using these commands together

• We want to randomly select 10 women from each of 46 counties in SC

• Step 1: generate random numbers– gen z1=runiform()

• Step 2: sort and number women within counties– sort county z1– by county : gen countyid=_n

• Step 3: keep only 10 women in county– drop if countyid>10

Page 30: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Formatting Dates

• Dates do not always maintain formatting, especially when reading data from csv files

• Two steps: generate and format• Example stata syntax

– gen newdate=date(datevar, “MDY”)– format newdate %td

• Stata treats dates as integers (formatting is like labels) so they can be manipulated

• Month, day and year can be extracted• Also, see clock • There are a lot of details that can be found in the help file

Page 31: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Reshaping Data

• In Stata there is one command to reshape IF your data is in the right format.

• From long to wide:– i indexes the observation (e.g., patient, hospital)– j indexes the repeats (e.g., year, cycle, visit)– Also need to list which variables vary by j

Page 32: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Example: ceramide data

• Clinical trial in cancer patients• Ceramide (et al.) were measured every two cycles in

patients• Of interest: do changes in ceramide correlate with

outcome (e.g., response, survival)?• Data provided in long format– i is patient_id– j is cycle– Ceramide, etc. vary per patient– Some variables are constant (and stata can figure it out!)

Page 33: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Reshaping ceramide data

• reshape wide collecteddate - frombaselines1p, i(patient) j(cycle)

• reshape long: once Stata reshapes data in its recent memory, it can reshape again without any options

Page 34: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Reshaping wide to long

• Much more common• Many researchers “grow” their datasets by

columns instead of rows• Formatting needs to be specific– Variable names must have numeric suffix– Could require a fair amount of editing – Depends on how many repeats and variables

there are

Page 35: Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 7

Reshaping wide to longclearinsheet using "ceramide2.csv"rename cycle1totalceramidelevels totalceramidelevels1rename cycle1diseasestatus diseasestatus1rename cycle1c18ceramide c18ceramide1rename cycle3totalceramidelevels totalceramidelevels3rename cycle3diseasestatus diseasestatus3rename cycle3c18ceramide c18ceramide3rename cycle5totalceramidelevels totalceramidelevels5rename cycle5diseasestatus diseasestatus5rename cycle5c18ceramide c18ceramide5rename cycle3daysfromstart daysfromstart3rename cycle5daysfromstart daysfromstart5

reshape long daysfromstart diseasestatus totalceramidelevels c18ceramide , i(patient) j(cycle)

drop if totalcerami==.replace daysfromstart=0 if cycle==1