sidm3: combining and restructuring datasets; creating summary … co… · sidm3: combining and...

14
SIDM 3: Combining and Restructuring Datasets SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops SCCS=Stata Commands Crib Sheet.xls 3.1 SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups You might find that your data is in a very different structure to that needed for analysis. If you don’t know the commands taught here exist, you might feel very lost. I’ve known people spend years with Google & Stata help and not learn these. Stata manuals (rather than help) or this course are needed. Recap on preceding workshops, SIDM1 and SIDM2. These cover finding your way around Stata, useful basic commands, good house-keeping (such as importance of keeping a do file of commands), and getting your data into an appropriate format for analysis. This includes the need to check what you are doing, as you go along, and to label data appropriately. Learning objectives of this Session SIDM3 Merging to combine datasets. Reshaping is very useful with repeated measures data, and many different types of clustered data, to change its format, to allow for different types of analysis. The collapse command is also very useful in creating summary measures (across individuals or for groups of individuals). It is also useful in tidying up data, when occasionally information on one person is contained in 2 separate lines, and you want to combine this into one line. Learning objectives of further workshop SIDM4. Produce publication quality tables efficiently. Loops in Stata. Extracting analysis results. Contents 1. Combining/ merging datasets ......................................................................................................... 2 1.1 Terminology defining master and using datasets for merging: .............................................. 2 1.2 Merging when the merging variable(s) (person id?) are unique within each dataset ........... 3 1.3 Merging when master dataset has >one row of same values of merging var(s) (such as person id) ............................................................................................................................................ 3 1.4 Merging when using dataset has >one row with same values of merging vars. .................... 3 1.5 Combining datasets by adding one below the other. ............................................................. 4 1.6 Exercises on merging data ...................................................................................................... 4 1.7 Repeating merge command; choosing whether to keep non-merged records ..................... 4

Upload: others

Post on 25-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.1

SIDM3: Combining and restructuring datasets; creating

summary data across repeated measures or across groups

You might find that your data is in a very different structure to that needed

for analysis. If you don’t know the commands taught here exist, you might

feel very lost. I’ve known people spend years with Google & Stata help and

not learn these. Stata manuals (rather than help) or this course are needed.

Recap on preceding workshops, SIDM1 and SIDM2.

These cover finding your way around Stata, useful basic commands, good house-keeping

(such as importance of keeping a do file of commands), and getting your data into an

appropriate format for analysis. This includes the need to check what you are doing, as you

go along, and to label data appropriately.

Learning objectives of this Session SIDM3

Merging to combine datasets. Reshaping is very useful with repeated measures data,

and many different types of clustered data, to change its format, to allow for

different types of analysis. The collapse command is also very useful in creating

summary measures (across individuals or for groups of individuals). It is also useful in

tidying up data, when occasionally information on one person is contained in 2

separate lines, and you want to combine this into one line.

Learning objectives of further workshop SIDM4.

Produce publication quality tables efficiently. Loops in Stata. Extracting analysis results.

Contents 1. Combining/ merging datasets ......................................................................................................... 2

1.1 Terminology defining master and using datasets for merging: .............................................. 2

1.2 Merging when the merging variable(s) (person id?) are unique within each dataset ........... 3

1.3 Merging when master dataset has >one row of same values of merging var(s) (such as

person id) ............................................................................................................................................ 3

1.4 Merging when using dataset has >one row with same values of merging vars. .................... 3

1.5 Combining datasets by adding one below the other. ............................................................. 4

1.6 Exercises on merging data ...................................................................................................... 4

1.7 Repeating merge command; choosing whether to keep non-merged records ..................... 4

Page 2: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.2

1.8 Error messages on the merge command/ tidying up prior to merging to avoid too many

that do not merge successfully: .......................................................................................................... 4

1.9 Merging options with the same variable names in both datasets ......................................... 5

1.10 Very rare choice: merging by forming all pairwise combinations of two datasets with the

same patientid, ................................................................................................................................... 6

2. Reshaping repeated measures/ clustered datasets ....................................................................... 6

2.1 Needs with reshape to long format ........................................................................................ 7

2.2 Needs and error messages with reshape wide ....................................................................... 7

2.3 Creating a count within cluster (within famid, or of repeated measures within patientid) to

prepare for reshape wide ................................................................................................................... 8

2.4 Exercises giving examples of reshape command of increasing complexity ............................ 8

3. Creating variables containing summary statistics........................................................................... 9

3.1 Reducing data (collapse) from several lines per cluster to one line of summary statistics .... 9

3.2 Generating new summary statistics variables without changing data structure ................. 10

3.3 Different functions on the generate command: ................................................................... 11

3.4 Exercises on collapse and egen commands .......................................................................... 11

3.5 Tidying up data with the occasional erroneous duplication of patientid with exercise ....... 11

3.6 Extra features with repeated measures data (creating counts of repeats & explicit

subscripting) ...................................................................................................................................... 12

4. When each line of data represents different numbers of people/different importance ............. 13

4.1 When one line of data represents several people/ items of data ........................................ 14

4.2 When different lines have different levels of precision/ importance .................................. 14

1. Combining/ merging datasets

1.1 Terminology defining master and using datasets for merging:

Master dataset is the one currently open within Stata. Note that only one dataset at a time can be

open within Stata. Opening a second data set can only be done when clearing the first one from

memory – unless we choose to combine the datasets with merge or append commands.

Using dataset which I am calling extradataset is the Stata dataset which is not currently open, which

we want to combine with the master dataset. Substitute your own dataset name for extradataset.

Page 3: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.3

Datasets are merged by a variable, patientid (or list of variables) which are present in both datasets

(these can be string or numeric variables). Substitute the name of the variable or variable list that

you want to merge by for patientid.

The following commands assume that cd command is used to change the default directory to the

one where the Stata extradataset is stored.

1.2 Merging when the merging variable(s) (person id?) are unique within each

dataset

For one to one matching, use this version of the command:

merge 1:1 patientid using extradataset //with 1 id “patientid” unique within each dataset

merge 1:1 county GPlist patientid using extradataset /* with a combination of id’s “county

GPlist patientid”, where the combination is unique within each dataset */

This merges the master and the using datasets, by matching values of the patientid variable(s). This

gives an error message, if there are two lines of data in either dataset with the same patientid (since

1:1 merging is specified – this is a useful warning when patientid is expected to be unique).

1.3 Merging when master dataset has >one row of same values of merging

var(s) (such as person id)

Dataset already open in stata might contain repeated measures data, or data on cluster members

when matching by cluster. The using data, extradataset, must have only one line of data per

patientid for this version of the command. Merging in this way puts the same information against all

lines of data for the same value of patientid/ cluster members when matching on cluster variable.

merge m:1 patientid using extradataset /* as above, can replace “patientid” with one or

more variables, according to how you want to match */

1.4 Merging when using dataset has >one row with same values of merging

vars.

So this is as above, when master and using datasets are switched around, so now patientid is unique

in the master dataset (and not in the using dataset).

merge 1:m patientid using extradataset

Page 4: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.4

Merge m:m is dangerously unpredictable. What matches with what will depend on the ordering of

your data. Similarly match 1:1 _n is to be avoided – this merges 1st observation in one dataset with

first in the other dataset, but it is impossible to be certain if the correct observations really are

matched with each other without extensive further checking.

1.5 Combining datasets by adding one below the other.

Use append command, which never matches on any variables, rather each dataset is kept on

separate rows, with one below the other. This is useful when all or many of the variables are the

same in both datasets.

append using extradataset

1.6 Exercises on merging data

1) Combine ihd_pain_first.dta and ihd_pain_more.dta, and save the result as ihd_pain.dta. Do you

need to use the merge or the append command?

2) Combine ihd_pain.dta, ihddata.dta and ihd_household.dta. Do you need to use the merge or the

append command?

When using this guide as a learning tool, in workshop format, you may want

to skip from here to the reshape section 2. The rest of this section on merging

is potentially more useful for reference.

1.7 Repeating merge command; choosing whether to keep non-merged

records

Do you want to keep only data for patientid’s that have merged? A variable _merge is created and

tabulated when “merge” is run, to show how many lines of data have merged successfully. You can

use _merge variable to drop some observations (e.g. dropping any which do not have merged results

keep if _merge==3, or dropping any, where you do not have data from the master dataset drop if

_merge==2).

Repeating the merge command: You will need to drop or rename the _merge variable before doing

another merge command. Sometimes it is useful to use the merge command many times in

succession to merge several datasets.

1.8 Error messages on the merge command/ tidying up prior to merging to

avoid too many that do not merge successfully:

An error will be obtained when patientid is not unique when it is expected/ meant to be. The

following are helpful to explore this issue:

Page 5: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.5

isid patientid // gives error message if patientid is not unique & no response when unique

help duplicates /* to explore duplicates further to decide how to deal with them –

duplicates drop drops all but the first instance of the duplicate, which might not be

appropriate – see below for better control */

help collapse /* when you want to extract information from both lines of the duplicate

patientid and want to specify how this will be done, prior to merging */

help string function /* when patientid/ matching variable is a string variable and you need

to tidy it up before you can match on it – for instance, you can omit leading and trailing

blanks (trim function) and convert all letters to lower case: gen

patientid2=trim(lower(patientid)) – this helps with consistency between patientid’s (do this

on both master and using datasets prior to merging). If this still fails, you can extract

substrings to merge on, specifying length of substring and starting place (help function

substr) although you might then need to do further checks to see if they look like valid

matches. You can read list of string functions to find other ways of extracting useful

information.

1.9 Merging options with the same variable names in both datasets

The merged dataset only contains one version of each variable, so if the same variable names are in

both datasets, with different contents, then some values may be lost.

Retaining values from the master dataset (the one currently open in Stata). Match by one key

variable, and don’t mention other variables that are the same in both datasets. Any discrepancies

between values in master and the using dataset are ignored.

Matching where all values of common variables are consistent between the datasets. Match by all

the variables which are present in both datasets (or by a chosen subset of them). Then if they do

match, you know that values on all variables agreed. You may need to additionally explore the

possibility that there are some observations that should match, even though not all these variables

are identical. You can additionally try matching by subsets of the original variables, to see how many

more matches you get, and aim to determine whether or not they look correct.

merge 1:1 patientid var1 var2 var3 using extradataset // matches by patientid, var1, var2, var3

Updating missing values in the master dataset with non-missing values from the using dataset:

merge 1:1 patientid using extradataset, update

Updating values in the master dataset with values from the using dataset (for any where there are

discrepancies), unless the value is missing in the using dataset. Use the replace option:

merge 1:1 patientid using extradataset, replace

Page 6: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.6

Put both versions of the common variables into the merged dataset, so that you can directly

compare them. Prior to merging, rename the common variables in one dataset, for instance by

adding 2 onto the end of the names (except for one or a few id variables on which you will merge).

Then merge using the one or few chosen id variables.

Compare datasets before merging using the cf command, to find discrepancies between the

variables. They will both need to be sorted in the same way and I think with the same observations

(usually same patients) for this to work. The following syntax compares all variables between the

master and the using datasets, and give results for all variables, including those that match.

cf _all using extradataset, verbose all // can list vars of interest instead of “_all”

1.10 Very rare choice: merging by forming all pairwise combinations of two

datasets with the same patientid,

According to values of patientid/ other specified merging variables: joinby patientid using

extradataset

Rarely used variation on the above: Suppose we want to hugely expand our dataset by forming

every pairwise combination possible, perhaps at least one of our initial datasets is tiny: cross using

extradataset

2. Reshaping repeated measures/ clustered datasets

People who don’t know this command can really struggle to do things which can be done quickly,

and might not have any idea what to search for to help. This is useful when you have repeated

measurements of some things, taken on the same individual, or perhaps different measurements

taken on different members of the household, or other type of cluster. This will move between the

two formats illustrated here:

help reshape provides an excellent summary of this command

Here we have families, with the ages of children, and a birth order variable (for order within

families). Long format:

famid birthorder age

1. 1 1 9

2. 1 2 6

3. 1 3 3

4. 2 1 8

5. 2 2 6

6. 2 3 2

7. 3 1 6

8. 3 2 4

9. 3 3 2

Page 7: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.7

Now, we have the same data, but with one line of data per family, with 3

age variables, representing age of first born (age1), age of second born

(age2), and age of third born (age3). Wide format:

famid age1 age2 age3

1. 1 9 6 3

2. 2 8 6 2

3. 3 6 4 2

Reshape will move between these two formats:

reshape wide age, id(famid) j(birthorder) /* from long to wide format – all 3 in names written in

brown are pre-existing variables */

reshape long age, id(famid) j(birthorder) /* from wide to long format (birthorder variable is

created by this command, it did not exist – age is not one variable but the start of a few variable

names (with numbers at the end – here age1, age2, age3) */

Both formats are really helpful; for any given analysis/ summary you will prefer a specific one. For

a whole analysis project, you will probably value both formats. Different analyses require or are

made easier by have data in a specific format.

2.1 Needs with reshape to long format

Note that to go from the second format illustrated to the first, we have variables called age1, age2

and age3. All names are the same, except for a number at the end. If they were named agefirstchild,

agesecondborn, agethirdborn, then we need to rename to age1, age2, age3 before using the

reshape command. If they are named age1_, age2_, age3_, then we need to rename to remove the

_ from the end of their names. You also need a family id variable or equivalent (personid for

repeated measures), which is uniquely (so different on each row) when going from wide to long

format. The “j” variable (named birthorder here) will be created by Stata in the process of reshaping,

from the numbers at the end of the variables names (so take values 1,2, 3 from age1, age2 and age3

vars)

2.2 Needs and error messages with reshape wide

You need a variable which uniquely defines family (here where we have children clustered within

family), or which uniquely defines person (for repeated measures clustered within person).

We need another variable (here “birthorder”), which takes a relatively small number of integer

values. The number of separate values it takes (here 1, 2, 3) corresponds to the number of new

variables created, which are here called age1, age2, age3.

All variables that vary within cluster (famid here or patientid for repeated measures data) need to be

either:

Dropped prior to reshaping

Page 8: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.8

OR specified in the reshape command

Errors messages of form: “Variable weight not constant within famid” indicate that this has not

been done – i.e. need to either drop weight or specify in the reshape command.

reshape wide age weight var1 var2 var3 var4 , id(famid) j(birthorder) // can include a long var list

This can dramatically increase the number of variables in your dataset, hence it is wise to drop ones

that are not needed instead of specifying them in this command.

2.3 Creating a count within cluster (within famid, or of repeated measures

within patientid) to prepare for reshape wide

Suppose you have patientid, then a date variable to represent different visits, but no visit number

variables. You need to initially create a visitno variable, before using the reshape command:

sort patientid visitdate // important to sort the data appropriately first

by patientid: gen visitno=_n // generate a count within patientid of the visit number

The above syntax will define new variable visitno to equal 1, 2, 3, 4, 5…, starting at 1 again for each

new patientid. Note that gen nnn=_n creates a count going from 1 (on first row), 2 (on 2nd row), 3

(on 3th row),.. We create this count within each patientid (in above syntax) to start again at 1 for

each new person, so counting repeated measures within patientid.

Reshape – further support in case of difficulties: http://www.stata.com/support/faqs/data-management/problems-with-reshape/

Examples using reshape command twice in succession:

http://www.ats.ucla.edu/stat/stata/faq/doublewide.htm

2.4 Exercises giving examples of reshape command of increasing complexity

3) Open puppies.dta, drop the names. Now reshape so that each puppy has its own line of data

showing its birth-weight.

4) Reopen the dataset puppies.dta, and this time keep the names. Now reshape so that each puppy

has its own line of data, showing its birth-weight and name.

5) Reshape kittens.dta, so that each kitten has its own line of data with its own birth-weight. Hint:

you need to rename some variables first. Now use the reshape command again (changing “long”

to “wide” in the command) to get back to the original data structure. Notice how the coding

does and does not relate to variable names, in each version of the command.

6) Reshape kittens_puppies.dta, so that each baby (kitten or puppy) has its birth-weight on its own

line.

Page 9: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.9

7) Reshape husband_anx_deprn.dta so that the baseline and follow-up PHQ scores both appear on

the same line.

8) Reshape wife_deprn.dta so that the baseline and follow-up PHQ scores both appear on the same

line. Hint: you will need to firstly create a variable that contains the visit number, using _n.

9) Reshape couple_deprn_anxiety2.dta dataset, so that the baseline and the follow-up visits occur

on separate lines. Reshape again, so that data for husbands and wives also occur on separate

lines. (wife/baseline, wife/follow-up, husband/baseline and husband/follow-up, all on separate

rows for each family).

10) Reshape the ihddata.data into wide format.

3. Creating variables containing summary statistics

3.1 Reducing data (collapse) from several lines per cluster to one line of

summary statistics

Use this where you have one line of data per repeated measurement, when you want one line of

data giving summary measures per person. Or when you have one line of data per person, when you

want summary measures per cluster of people.

collapse var1 var2 var3, by(patientid) // gives patient-specific means of var1, var2, var3

The default statistic given is the mean, hence the above gives the mean of the specified variables for

each patientid. Specifying by patientid results in one line of data per patientid (this implies that the

data contained repeated lines for at least one/ some of the patients). Replace patientid by your

clustering variable (which could be gp_practice_no or familyno or similar).

We can specify some other summary measurement if we prefer:

collapse (min) var1 var2 var3, by(patientid) // gives patient-specific minimums of var1, var2, var3

We might want different summary measures for different variables:

collapse (min) var1 (mean) var2 var3, by(patientid) /* gives patient-specific minimum for var1,

. and patient-specific means of var2 and var3 */

By default, the variables in the transformed dataset (which are summary statistics) have the same

name as the original variables. So in the transformed (collapsed) dataset has var1=patient-specific

minimums of values of the original var1. Var2 = patient-specific mean of values of the original var2. If

we want both the mean and the minimum of the same variable, then we need to rename one of

them:

collapse (min) minvar1=var1 minvar2=var2 (mean) var2 var3, by(patientid) /* gives patient-

. specific minimum for var1 (renamed minvar1), for var2 (renamed minvar2)

. and patient-specific means of var2 and var3 (var2 and var3 now represent these means */

Page 10: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.10

See help collapse for list of different summary statistics available. Any variables not specified in the

collapse command will be dropped. If you want to keep them, then consider using a long list of

variables in the by( ) part of the command (this is useful for any variables which do not vary by the

cluster variable, so in this example for variables which are constant for each patientid).

3.2 Generating new summary statistics variables without changing data

structure

Generating new variables which are summary statistics of one variable, or summary statistics for

each variable across a number of variables: egen command

egen weightmean= mean(weight)

creates a new variable (here called weightmean) which is a constant = mean of whatever is in

brackets (the single variable, weight in this example, or could be an expression).

by gender: egen bmimean=mean(weight/height^2)

creates a new variable (here called bmimean) which is a constant for each gender, equal to mean of

the variable/ expression in brackets. Many other summary statistics are available, as well as counts

of how many observations in that variable are equal to a single specified value, or equal to any one

of a number of integer values. See help egen.

There are also many egen options with “row” in them, which give the summary statistics for each

observation (row of data), across the variables listed. These generally work even if the value on

some variables are missing, they give summary statistics across non-missing values.

egen medianbp= rowmedian(bp1 bp2 bp3 bp4 bp5 bpp6 bp7 bp8)

This gives the median across the variables bp1, bp2, bp3, bp4, bp5, bp6, bp7, bp8. It calculates this

separately for each row of data. If there are missing values then it still gives an answer (the median

of the non-missing values). Help egen gives you a list of all the possible things you can do with this

command. You might think that you can easily calculate the mean of 5 variables without the egen

command, but you can’t when you have missing data.

gen sbp_mean = (sbp1+sbp2+sbp3+sbp4+sbp5)/5

replace sbp_mean=(sbp1+sbp2+sbp3+sbp4)/4 if sbp5=.

replace sbp_mean=(sbp1+sbp2+sbp3+sbp5)/4 if sbp4=.

There are 32 different possible combinations of missing data, so we might need 30 replace

statements. This following does all this in just one command:

egen sbp_mean=rowmean(sbp1 sbp2 sbp3 sbp4 sbp5)

Page 11: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.11

3.3 Different functions on the generate command:

To find out what functions you can use, type help function. I use this more than any other help.

Maths functions – including logs, exponential, min, max, total, many trigonometry functions,

rounding numbers, absolute value, sign of number, logit.

Density function – for probability distributions and density functions

Random number functions – in case you want to generate random numbers

Date and time functions – may be useful

String functions – for dealing with string variables, including options for turning them into numeric

data, and also searching strings for particular characters, and selecting part of a string.

Also remember help egen, to give options available for this related command.

3.4 Exercises on collapse and egen commands

11) With the puppies.dta dataset, find the mean birth weight of each litter, using the egen

command, with the option rowmean. Use help egen, and read down the list to find what other

row summary statistics are available (e.g. median, percentiles, SD, iqr). What are the advantages

of using this command over the following?

gen mean_bwt=(birthweight1+ birthweight2+ birthweight3+ birthweight4+ birthweight5)/5.

12) Read the husband_anx_deprn.dta dataset in and (without reshaping) use the egen command

with the option mean to find mean depression score in each man.

13) Read the husband_anx_deprn.dta dataset in again and use the collapse command to find the

mean depression score for each woman, and also record the number of scores used to create

the average. Use help collapse to see what other summary statistics are available. Notice that

you now have only one line of data per man, and probably fewer variables.

14) With ihddata.dta, you want variables which say whether or not any member of the household

has had CVD added to this dataset. How would you go about producing this?

15) Suppose you want to create a new variable which says whether or not each household member

has had either diagnosed high BP, or CVD. How do you create this?

3.5 Tidying up data with the occasional erroneous duplication of patientid

with exercise

SIDM2 section 5 gives an example of the duplicates command being used in practice.

Page 12: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.12

16) Look at the couple_deprn_anxiety dataset. Note that there are a few lines of data with one

familyid (=255), yet we only expect one line per family. Tidy up the dataset, so that there is just

one line for this familyid – the collapse command might be useful here, although then we lose

the variable labels (though it is possible to get the original ones back again). We might prefer the

use of egen commands and then dropping duplicate lines of data.

3.6 Extra features with repeated measures data (creating counts of repeats &

explicit subscripting)

These are useful in some situations, and give great flexibility. There are sometimes simpler ways of

doing these things.

Explicit subscripting is a versatile tool, to check for duplicates, to carry values forward from one line

to the next (perhaps creating changes from last time point, with data sorted by time).

To count number of unique value of patientkey:

sort patientkey

count if patientkey!= patientkey [_n-1] // after sorting, will number of unique values of patientkey

This can be written “count if patientkey[_n]!=patientkey[_n-1]” – [_n] denotes the nth line of data, so

this counts when patientkey on the nth line is not equal to patientkey on the (n-1)th line of data, i.e.

on the row before. This comparison is made for all n, from 1 to N, where N=number of observations

in your data set. [Compare this to a more common command which uses “implicit subscripting”,

which means we don’t need to give the subscripting. e.g. “count if sex=gender”, which is equivalent

to “count if sex[_n]==gender[_n]”, which compares the nth line of data for sex with the nth line of

data for gender, for n=1,2,3… N. In other words, for each line of data, it looks to see if sex and

gender are the same.]

Checking for duplicates:

count if patientkey== patientkey [_n-1] // after sorting, will count duplicate values of patientkey

list if patientkey==patientkey[_n-1] | patientkey==patientkey[_n+1] // after sorting, will list all

variables in lines of data which have the same patientkey as another line of data.

“help duplicates” does the above 2 duplicates commands with slightly less fiddly syntax – though the

price you pay is needing to learn more different commands/ formats of commands, rather than

using explicit subscripting, which is highly flexible once learnt.

Creating change variables: for example create time intervals, with repeated measures at different

times within patientkey:

Page 13: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.13

sort patientkey time // crucial to sort the data first

gen time_int=time-time[_n-1] if patientkey==patientkey[_n-1] // creates time intervals

The above command results in time_int=., so missing at the earliest time for each patientkey. Then

for later time, it results in the time interval since the previous record on that same person.

Creating count variables

gen nnn=_n – this creates a variable which has the same value as the observation number

printed down LHS of data when browsed. You might want to sort appropriately first.

gen cnt=_N – this creates a variable which gives the total number of observations when it is

created, taking the same value for each observation.

sort patientkey time

by patientkey: gen timept=_n – creates a variable called timept which counts the

observation in each person (defined by variable patientkey) from the earliest time (timept=1) to the

latest time for that person.

by patientkey: gen notimept=_N – creates a variable called notimept which counts the

total number of time points (i.e. lines of data) within each person

browse patientkey timept notimept // check what is created

tab notimept if timept==1 // table of number of time points for each person – avoiding duplicate

counting of people

Keeping first and last values in our dataset, and putting both onto the same row of data:

keep if timept==1 | timept==notimept // keeps just first and last timepoints in each person

gen time_end=time if timept==notimept // create a new var=final time point, only created on the

lines of data representing the final time points

by patientkey: egen time_end2=median(time_end) // this puts the time_end data onto all lines of

data (now named time_end2)

keep if timept==1 // keeps just one line of data per person, now with first and last values

4. When each line of data represents different numbers of

people/different importance

Page 14: SIDM3: Combining and restructuring datasets; creating summary … co… · SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM 3: Combining and Restructuring Datasets

SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops

SCCS=Stata Commands Crib Sheet.xls 3.14

4.1 When one line of data represents several people/ items of data

Use the expand command to create extra lines of data, so you can end up with one line of data per

person.

expand 2 // creates a duplicate line of data for each existing line of data

expand frq /* creates news lines of data for each existing line of data when frq>=2, so when frq is a

variable indicating how many duplicates of each line of data there are */

Alternatively you can use weights after each command. This can be used after very many different

commands, by adding [fweight=frq] to the end of them as follows:

tab state [fweight=frq]

If you are interested in such commands because your data is not in Stata at all, then note that there

are immediate versions of some commands, where the data is written into the command line:

tabi 12 3 \ 43 5, chi2 // creates a table and perform a chi-squared test on data in this command line

See also help ttesti for immediate version of t-test commands.

4.2 When different lines have different levels of precision/ importance

Sampling weights denote the inverse of the probability that the observation is included, based on

sampling design. They can be added to the end of virtually any command:

regress yvar xvar [pweights=invprob] /* weighted linear regression, where the var named invprob

represents the inverse of probabilities, i.e. 1/probability for each observation */

There is a suite of survey commands to deal with such data, which deals with more sophisticated

data too – see Stata manual [SVY]

Analytic weights, are weights that are inversely proportional to the variance of an observation;

that is, the variance of the jth observation is assumed to be sigma^2/w_j, where w_j are the weights.

Typically, the observations represent averages and the weights are the number of elements that

gave rise to the average. For most Stata commands, the recorded scale of aweights is irrelevant;

Stata internally rescales them to sum to N, the number of observations in your data, when it uses

them.

ttest yvar, by(var2) [aweights=inv_var] /* weighted t-test, where the var named inv_var represents

the inverse of the variance of the data, yvar */