introduction to data handling. a fast hour review of data types scalar, ordinal, nominal decisions...

62
Introduction to Data Handling

Post on 15-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Introduction to Data Handling

Page 2: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

A Fast Hour

Review of data types Scalar, ordinal, nominal

Decisions regarding encoding data Turning information into analyzable data Dealing with missing data

The structure of experimental data Getting things into 2 dimensional (or a few dimensional) tables

Deciding on which software to use Excel Spreadsheet-style analysis packages Scripted analysis

Page 3: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Review of Data Types

Page 4: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Review of Data Types

Scalar Continuous Discrete

Ordinal

Nominal

Page 5: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Scalar Data

Continuous Data Real numbers used to measure magnitude Unbounded at least in one direction Ex: Average Dilantin level

(3.1+4.4)/2 = 3.75

Discrete Data Data that can take on a finite number of values Unbounded at least in one direction Ex: Average number of fingers

(5+4)/2 ≠ 4.5, but is ‘in between’ 4 and 5

Page 6: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Scalar Data

Truly continuous data are theoretical – you don’t run into them in the real world

Because of limitations of measurement (e.g., significant figures), scalar data are actually discrete

In most real life applications, discrete data can be handled as if continuous Just beware of the ‘2.3 kids’ problem

Page 7: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Ordinal Data

Data whose attributes are ordered but for which the numerical differences between adjacent attributes are not necessarily interpreted as equal

Bounded Scale has some upper and lower limit

Classic Example: Glasgow Coma Scale GCS of 4 intuitively ranks lower than GCS of 5 Difference of GCS of 14 and 15 is not the same as difference

between GCS of 3 and GCS of 4 GCS of 4 + GCS of 5 ≠ GCS of 9

Page 8: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Nominal Data

May have an assigned numerical value for analytical reasons, but there is no numerical underpinning for the variables

Example: Race African american = 1 Hispanic = 2 Asian = 3 1 + 2 ≠ 3

Page 9: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Turning information into analyzable data

Page 10: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Turning information into analyzable data Discrete data are usually easy

Age Vital signs One dimensional measures (e.g., Hgb, time-to-relapse)

Ordinal and nominal data get tricky If you’re only going to do descriptive statistics, it doesn’t

matter much If you’re going to model (e.g., do regression) it gets

involved

Page 11: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Real Life Example from the Camp Survey Question 3. On a usual camp day, the

person on site with the highest level of health care training is a:

Physician Registered nurse Licensed practical nurse Licensed paramedic Licensed EMT Licensed first responder First aid provider

Page 12: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Real Life Example from the Camp Survey

What type of variable would you use?

Page 13: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Real Life Example from the Camp Survey One choice:

A continuous variable

On a usual camp day, how many years of training has the senior-most caregiver completed?

Var_Years

Page 14: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Real Life Example from the Camp Survey Another more likely choice:

An ordinal variable

Physician = 1

RN = 2

LPN = 3

Paramedic = 4

EMT = 5

First responder = 6

First aid = 7

Var_Caregiver

Page 15: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Real Life Example from the Camp Survey A Third Choice

Seven nominal ‘dummy variables’

Var_MD = 1 or 0 (yes or no)Var_RN = 1 or 0 Var_LPN = 1 or 0 Var_Para = 1 or 0 Var_EMT = 1 or 0 Var_Respond = 1 or 0 Var_FirstAid = 1 or 0

Page 16: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Real Life Example from the Camp Survey Who cares?

Var_Caregiver

1234567

1 0 0 0 0 0 0

0 1 0 0 0 0 0

0 0 1 0 0 0 0

0 0 0 1 0 0 0

0 0 0 0 1 0 0

0 0 0 0 0 1 0

0 0 0 0 0 0 1

7 DummyVariables

Var_Years + Real Numbers

Page 17: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

A Basic Modeling Problem

Is there a relationship between the level of on-site caregiver training and the number of deaths per year at camp?

Deaths = f (Caregiver Level)

Page 18: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Num

ber

of D

eath

s

Var_Caregiver1 7

Deaths = b1x1 + b0

where x1 = Var_Caregiver (1-7)b1 = a coefficientb0 = the y-intercept

Page 19: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Num

ber

of D

eath

s

Var

_MD

Var_

Firs

t_Ai

d

Deaths = b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6 + b7x7+ b0

where x1 = Var_MD, x2 = Var_RN, etc.b1-7 = are coefficients for each xb0 = the y-intercept

Var

_RN

Var

_Par

a

Var

_LP

N

Var

_EM

TV

ar_R

espo

nd

Page 20: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Num

ber

of

Dea

ths

Var_Caregiver1 7

Nu

mb

er

of

De

ath

s

Var_MD Var_First_Aid

Pros: Easy to compute Easy to understand

Cons: Forces a ‘continuous’ structure onto Var_Caregiver that may not really exist

Pros: Agrees more closely with experimental results Doesn’t impose any relationship between different provider levels Cons: Less easy to understand ‘Discards’ the knowledge that some caregivers have more training than others

Page 21: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as

you can

Page 22: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Stick close to original measurement When you call an ambulance in an

emergency, how long does it take for the ambulance to get to your camp?

< 5 minutes (Time = 1) 5-10 minutes (Time = 2) 10-15 minutes (Time = 3) 15-20 minutes (Time = 4) > 20 minutes (Time = 5) Don’t know (Time = 6)

Good, bad,Indifferent?

Page 23: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Stick close to original measurement Do you know how long it would take an

ambulance to respond to a call from your camp? (y/n)

If so, how many minutes? (some discrete #)

Page 24: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as

you can Abstraction seems useful, but distances you from

what you were originally looking at Keep continuous data continuous if at all possible Likewise preserve ordinal and nominal data Later on, you can ‘digest’ the raw data into

categories, etc., as necessary.

Page 25: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Decisions regarding how to encode data Remember:

Data can always be made more general during analysis. They cannot be made more specific.

Page 26: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as

you can

Avoid bundling more than one idea into a single variable Ex. <5, 5-10, 10-15, 15-20, > 20, ‘Don’t Know’

Page 27: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as

you can

Avoid bundling more than one idea into a single variable

Use a specific plan for missing data!

Page 28: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Missing Data

Blank data cells are ambiguous Data not provided/collected? Data erroneously omitted? Data provided but nonsensical?

Note: Many statistical packages will ignore an entire ‘observation’ if a data point is missing!!!

Page 29: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Missing Data

Pick something (other than nothing) to denote a missing data point ‘.’ or ‘Null’ are commonly used

Page 30: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

The Structure of Data

Statistical analysis is based on the idea of ‘observations’ An observation often is a patient (and all of the

data you collect about that patient) Really is just an experimental ‘unit’ or ‘trial,’ such

as one summer camp or one hospital day

Any analysis of many observations requires you establish a ‘structure’ for your observations

Page 31: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

The Structure of Data

You’ll need to think about the ‘shape’ of your experimental data early in your study Preferably during planning

Fortunately, very many data sets can be structured into a tabular form For better or worse, Excel is used really often

Page 32: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

The Structure of Data

Obs # Last Name Systolic BP Diastolic BP

1 Fawcett 114 54

2 Smith 93 42

3 Jackson 78 49

4 Ladd 58 38

Fields

Observations

Don’t confuse a 2-dimensional data table with 2-dimensional data!

Ultimately, every observation is a mathematical ‘vector’ that completely describes that event in an n-dimensional space

Page 33: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Fawcett

Jackson

Smith

Ladd

SB

P

DBP

Your data have as many dimensions as they havedata fields!

Page 34: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

(Unavoidable) Shortcomings of Tabular Data Large Number of Fields or Observations

Difficult to ‘look’ at all of the data

Troubles with Repeated Measures

Page 35: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into
Page 36: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Handling Repeated Measure Data in a Tabular Data Structure

Pat

ient

ID

Wei

ght

Day

1

BU

N D

ay 1

Urin

e D

ay 1

Wei

ght

Day

2

BU

N D

ay 2

Urin

e D

ay 2

Wei

ght

Day

3

BU

N D

ay 3

Urin

e D

ay 3

Wei

ght

Day

4

BU

N D

ay 4

Urin

e D

ay 4

Wei

ght

Day

5

BU

N D

ay 5

Urin

e D

ay 5

Page 37: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Obs # Last Name Hospital Day Systolic BP

1 Fawcett 1 84

2 Fawcett 2 72

3 Fawcett 3 84

4 Smith 1 94

Handling Repeated Measures in Tabular Data Structures

• The ‘Day in the Life’ strategy• A Patient Day becomes the observation• Can be a more compact way of saving data

Page 38: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

DemographicData

Daily Data(For each of 7

Study days

BacterialIsolateDataOutcome

Data

Using Relational Databases for More Complex Data

Page 39: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Deciding Which Software to Use Some useful groundrules

1. Use software with all of the tools you need

2. Don’t make things unnecessarily complicated

3. Know in advance what your statistical collaborators are going to use, and how they like the data to appear

Page 40: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Deciding Which Software to Use Data-entry Level Tools

Input method other than just entering fields in Excel spreadsheet ‘Forms’ type page Interface with other data types Interface with Scantron Interface with analytical instruments

Page 41: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into
Page 42: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Deciding Which Software to Use Data-entry Level Tools

Entry error control Double entry Restricted data fields that must fit a particular format or

be rejected Merging data sets

Doing this by hand is fine for 15 patients, but not for 1,500

Page 43: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Deciding Which Software to Use Data Manipulation Needs

Do your data need some post-collection modification prior to analysis? Transformation (e.g., log-transforming to achieve normal

statistical distributions) Relabeling missing data fields Text or numerical string modification

E.g. changing all dates to MM/DD/YYYY Internal data consistency checks

E.g. is the number of ICU days < the number of hospital days?

Page 44: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Deciding Which Software to Use What Analyses are You Going to Perform?

Summary Statistics Frequencies, means, etc.

Simple x by y regressions Contingency tables (and 2) ANOVA Multivariate modeling Logistical modeling Nonlinear modeling

Easy in Excel

Not Easy in Excel

Best Handled inDedicated StatsPackages or Elsewhere

Page 45: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Deciding Which Software to Use Output Needs

Tabular data that can be dumped into a word processor Text files Cut-and-paste

Graph preparations and dumping Cut-and-paste Specialized output formats

.tif, .jpg, .svg, MS metafiles Colors (RGB v. CMYK)

Page 46: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Deciding Which Software to Use Other needs you might not have thought

about but that are really important Interim “noodling” type analysis Needing to repeat the analysis on multiple data

sets, or to ‘update’ the analysis if new data become available

Page 47: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Deciding Which Software to Use

Spreadsheets Excel

Spreadsheet and ‘Pull-Down’ Stats Packages SPSS, Prism (Graphpad), JMP

Database Managers Access, Foxpro

Scripted Statistical Languages SAS, R, MatLab

Incr

easi

ng L

evel

of

Org

aniz

atio

nIn

crea

sing

Fro

nt-E

nd T

ime

Page 48: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into
Page 49: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into
Page 50: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Handling Your Data in Excel

Few up-front requirements Load your data and you’re ready to go Many simple stats can be done as ‘one off’

analyses VERY Inflexible

You pay for your choice later on in debugging, rerunning analysis, editing the data set, etc.

Page 51: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Using Spreadsheet-’Pulldown’ Stats Packages Is the most power most users will ever need

Slightly more up-front time Forced data structures are like eating oatmeal Most have integrated graphics utilities

Some unusual applications are tough to manage Nonlinear analysis

Page 52: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into
Page 53: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into
Page 54: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Using Scripted Statistical Packages When you anticipate running relatively complicated

analyses on a series of data sets

When you can design the analysis plan without having all of the data available

When you must document exactly how you did your analysis and be able to exactly duplicate it at will Which is arguably every time (!)

Page 55: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into
Page 56: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

g<-read.csv("expdata2.csv",header=TRUE)gmat<-as.matrix(g)gmati<-gmat*-1heatplot(gmati,Colv=NA)

Page 57: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Back-End Utilities

Graphical Output Excel has horrible graphics that can be spotted a

mile away in journals Most stats packages will do better

Consider ‘Post-Processing’ in Dedicated Graphics Software E.g., Adobe Illustrator

Page 58: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into
Page 59: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Research is a Data Business, Use the Tools at Your Disposal

Data Input System

DedicatedDatabaseManager

StatisticalPackage

StatisticalPackage

StatisticalPackage

GraphingSystem

GraphicsPolishing for Publication

Page 60: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Other Very Important Resources Google

Almost everything you need to know Most of it’s pretty accurate

Java Applets Many stats applications can be found on line that will run on any

machine

Open source code is on its way R Linux

CSCAR Sometimes more helpful than others.

Page 61: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Who Will Not Be Helpful

MCIT

Page 62: Introduction to Data Handling. A Fast Hour Review of data types  Scalar, ordinal, nominal Decisions regarding encoding data  Turning information into

Questions? People and their Software

Sue Stern JMP (The SAS ‘PullDown’ Package) Repeated measures analysis of clinical data

Bonnie Singal SAS Pretty much any clinical statistical research question

Matt Trowbridge Stats and GIS packages Merging complex data sets

John Younger SAS, Prism, R Kinetics, Logistic and nonlinear models of complex behaviors