Download - Data: Presentation and Description Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Data: Presentation and Description

Peter T. Donnan

Professor of Epidemiology and Biostatistics

Statistics for Health Statistics for Health ResearchResearch

OverviewOverview

•What is Data?What is Data?

•Summarising dataSummarising data

•Displaying dataDisplaying data

•SPSSSPSS

•What is Data?What is Data?

•Summarising dataSummarising data

•Displaying dataDisplaying data

•SPSSSPSS

Why have you Why have you collected data?collected data?

•Most important question!Most important question!

•Related to testing Related to testing hypotheseshypotheses

•If you have not got any If you have not got any hypotheses – Get some! hypotheses – Get some!

•Return to laterReturn to later

DATA – Where from?DATA – Where from?

•All data is a Sample – All data is a Sample – a subset of populationa subset of population

•How was it collected?How was it collected?

•Potential for bias?Potential for bias?

Extrapolating from the Extrapolating from the sample to populationsample to population

Illustrations Ian Christie, Orthopaedic & Trauma Surgery, Copyright

2002 University of Dundee

Quantitative Data?Quantitative Data?

•Observation or measurement of Observation or measurement of one or more variablesone or more variables

•Variable is any quantity measured Variable is any quantity measured on a scaleon a scale

•Unit of analysis can be person, Unit of analysis can be person, group (e.g. practice), specimen, group (e.g. practice), specimen, timetime

•Multilevel – patient and practiceMultilevel – patient and practice

Cross-classified 3 level multilevel Cross-classified 3 level multilevel modelmodel

Practice levelj

Patient leveli

Hospitalk

StatisticsStatistics

Statistics encompasses - Statistics encompasses -

1.1.Design of study;Design of study;

2.2.Methods of collecting, Methods of collecting, and summarising data; and summarising data;

3.3.Analysing and drawing Analysing and drawing appropriate conclusions appropriate conclusions from datafrom data

Variable typesVariable types

•Categorical Categorical (qualitative)(qualitative)–E.g. type of drug, eye E.g. type of drug, eye colour, smokercolour, smoker

•NumericalNumerical (quantitative) (quantitative)–E.g. age, birth weight, BPE.g. age, birth weight, BP

CategoricalCategorical NominalNominal

Categories are Categories are mutually mutually

exclusive and exclusive and unorderedunordered

Eg Blood group Eg Blood group type (A/B/AB/O)type (A/B/AB/O)

OrdinalOrdinalCategories are Categories are

mutually mutually exclusive and exclusive and

orderedordered

Eg Disease stage Eg Disease stage (mild/moderate/(mild/moderate/

severe)severe)

Binary - two categories (yes, no)

NumericalNumerical DiscreteDiscrete

Integer Integer values, often values, often

countscounts

Eg number of Eg number of cigarettes cigarettes smokedsmoked

ContinuousContinuousTakes any Takes any value in a value in a range of range of valuesvalues

Eg Height in Eg Height in cm, cm,

cholesterolcholesterol

Organisation of dataOrganisation of data

Generally each variable in separate columns and one row per subjectGenerally each variable in separate columns and one row per subject

Subject Age Gender Score

1 28 1 15

2 56 2 11

3 43 1 22

Generally each variable in separate columns and one row per subjectGenerally each variable in separate columns and one row per subject

Subject Age Gender Score

1 28 1 15

2 56 2 11

3 43 1 22

11stst step in step in analysis?analysis?

Look at the data!

Display and summarise Display and summarise datadata

•To get a feel for the dataTo get a feel for the data

•To spot errors and missing To spot errors and missing datadata

•Assess the range of valuesAssess the range of values

•Also ..Also ..

Caregorical dataCaregorical data1. Campylobactor 21. Giardia

2. Campylobactor 22. Crytosporidium

3. Escherichia coli 0157 23. Crytosporidium

4. Shigella sonnei 24. Campylobactor

5. Crytosporidium 25. Shigella sonnei

6. Giardia 26. SRSV

7. Crytosporidium 27. Crytosporidium

8. Campylobactor 28. Campylobactor

9. Campylobactor 29. Giardia

10. Crytosporidium 30. Giardia

11. Giardia 31. Escherichia coli 0157

12. Shigella sonnei 32. Shigella sonnei

13. SRSV 33. Crytosporidium

14. Giardia 34. SRSV

15. Escherichia coli 0157 35. Campylobactor

16. Campylobactor 36. Campylobactor

17. Giardia 37. Campylobactor

18. SRSV 38. Giardia

19. Campylobactor 39. Escherichia coli 0157

20. Crytosporidium 40. Campylobactor

InfectionInfection N (%)N (%)

CampylobactorCampylobactor 12 (30.0)12 (30.0)

CryptosporidiumCryptosporidium 9 (22.5)9 (22.5)

GiardiaGiardia 8 (20.0)8 (20.0)

SRSVSRSV 5 (12.5)5 (12.5)

Escherichia coli 0157Escherichia coli 0157 3 (7.5)3 (7.5)

ShigellaShigella 3 (7.5)3 (7.5)

TotalTotal 40 (100)40 (100)

Summarised by frequencies Summarised by frequencies or percentageor percentage

Numerical dataNumerical data

•Frequency distributions for Frequency distributions for continuous variable continuous variable unfeasibly largeunfeasibly large

•Grouping may be Grouping may be necessary for presentationnecessary for presentation

Age group(years) Frequency

RelativeFrequency

(%)

Cumulative relative

frequency (%)

0-4 59 12.2 12.2

5-9 83 17.1 29.3

10-14 94 19.4 48.7

15-19 72 14.8 63.5

20-24 61 12.6 76.1

25-29 48 9.9 86.0

30-34 36 7.4 93.4

35-49 32 6.6 100

485 100

Frequency distribution for Frequency distribution for continuous variablecontinuous variable

Baseline measure N (%)

4.0 52 (3.1)

4.1 51 (3.0)

4.2 49 (2.9)

4.3 65 (3.9)

4.4 60 (3.6)

4.5 80 (4.8)

4.6 88 (5.2)

4.7 99 (5.9)

4.8 94 (5.6)

4.9 84 (5.0)

5.0 68 (4.1)

5.1 66 (3.9)

5.2 79 (4.7)

5.3 74 (4.4)

5.4 75 (4.5)

5.5 75 (4.5)

5.6 70 (4.2)

5.7 60 (3.6)

Baseline group

N (%)

4.0 to 4.4 277 (16.5)

4.5 to 4.9 445 (26.5)

5.0 to 5.4 362 (21.6)

5.5 to 5.9 340 (20.3)

6.0 to 6.9 253 (15.1)

TotalTotal

1677

Guide for grouping dataGuide for grouping data

•Obtain min and max valuesObtain min and max values

•Choose between 5 and 15 Choose between 5 and 15 intervalsintervals

•Summarise but not obscure Summarise but not obscure data especially continuous datadata especially continuous data

•Intervals of equal widthIntervals of equal width– Good but not essentialGood but not essential– Remember to label tables!Remember to label tables!

Take care with missing Take care with missing valuesvalues

•SPSS gives % missing in output if SPSS gives % missing in output if missing left blank in datamissing left blank in data

•Careful in reporting % as Careful in reporting % as percentage of observed values percentage of observed values oror percentage of all subjectspercentage of all subjects

•These will differ!These will differ!

•Can use missing code (often 9) to Can use missing code (often 9) to make missing explicit in outputmake missing explicit in output

GraphsGraphs

•SimplicitySimplicity

•ConsistencyConsistency

•Not duplicating tables or Not duplicating tables or texttext

•Remember TitleRemember Title

•Remember Label axesRemember Label axes

Graphs – Categorical Graphs – Categorical datadata

•Bar chartsBar charts

•Pie chartsPie charts

Bar chartsBar charts

• Used to display categorical (or Used to display categorical (or discrete numerical data)discrete numerical data)

• One bar per categoryOne bar per category

• Height of bar equals its frequencyHeight of bar equals its frequency

• Each bar same width and equally Each bar same width and equally spacedspaced

• Space between each barSpace between each bar

• Vertical axis must start at zeroVertical axis must start at zero

Most common cancer deaths Most common cancer deaths in UK, 2009in UK, 2009

Plots and Statistics from CRUK website http://info.cancerresearchuk.org

Pie chartsPie charts

•Displays one variable Displays one variable onlyonly

•Compare 2 groups Compare 2 groups using 2 chartsusing 2 charts

BBC Sports Personality of the Year 2008

HoyHamiltonAdlingtonAinslieCalzagheMurrayCookeOhuruoguWigginsRomero

But avoid 3-dimensional plots!

Graphs – Numerical dataGraphs – Numerical data

•HistogramsHistograms•Frequency polygonFrequency polygon•Cumulative frequency Cumulative frequency polygonpolygon

•Scatter plotsScatter plots•Box plotsBox plots

HistogramsHistograms

•Like bar charts but no Like bar charts but no spacesspaces

•y axis always begins at y axis always begins at zerozero

•Area of bar represents the Area of bar represents the frequency in each groupfrequency in each group

Check data carefully

Florence Nightingale’s Florence Nightingale’s ‘Coxcomb’ diagram of ‘Coxcomb’ diagram of

Mortality in the Crimea Mortality in the Crimea WarWar

Summary measures – Summary measures – Numerical Numerical datadata•Central Location (average)Central Location (average)

•Spread or variability Spread or variability (distance of each data (distance of each data point from the average)point from the average)

Central LocationCentral Location•MeanMean

•Median Median

•Mode - most frequent valueMode - most frequent value

MeanMean _x = x1 + x2 +x3+ ….. +

xn

N

Often written as ∑xi / NWhere Sigma or ∑ is ‘Sum of’

2.75 2.86 3.37 2.76 2.62 3.49 3.05 3.12 _ x = 24.02 8

= 3 litres

MeanMean• AdvantagesAdvantages

– Uses all data valuesUses all data values– Very amenable to statistical Very amenable to statistical

analysis; most models deal with analysis; most models deal with meanmean

• Disadvantages (advantages to Disadvantages (advantages to politicians and estate agents!)politicians and estate agents!)– Distorted by outliersDistorted by outliers– Distorted by skewed dataDistorted by skewed data

MedianMedian

Arrange values in increasing Arrange values in increasing orderorderMedian is the middle valueMedian is the middle value

2.62 2.75 2.76 [2.86 3.05] 3.12 3.37 3.49

Median = 2.86 + 3.05 = 2.96 litres 2

MedianMedian

MedianMedian

• AdvantagesAdvantages– Not distorted by outliersNot distorted by outliers– Not distorted by skewed dataNot distorted by skewed data

• DisadvantagesDisadvantages– Ignores most of the informationIgnores most of the information– Less amenable to statistical Less amenable to statistical

modellingmodelling

Measures of spreadMeasures of spread17 24 29 36 [47 52] 66 67 81 94

Mean = 51.3 Median = 49.5

50 51 51 51 [51 51] 51 51 51 55

Mean = 51.3 Median = 51

RangeRange17 24 29 36 [47 52] 66 67 81 94

Range 17-94 or 77

50 51 51 51 [51 51] 51 51 51 55

Range 50-55 or 5

Range from Range from percentilespercentiles•Data ordered from smallest Data ordered from smallest

to largest valueto largest value

•PercentilesPercentiles

•Deciles –data in equal 10thsDeciles –data in equal 10ths

•Quartiles = data in equal Quartiles = data in equal 4ths4ths

Interquartile range Interquartile range (IQR)(IQR)

4 5 7 | 9 10 12 | 14 19 26 | 39 40 42

8 (lower quartile) 32.5 (upper

quartile)

Interquartile range (IQR) = 32.5 - 8

= 24.5

Highly satisfiedSomewhatsatisfied

NeutralSomewhatdissatisfied

Highly dissatisfied

Job satisfaction

80

70

60

50

40

30

20

10

Age in y

ears

6,397

6,366

6,286

Median

Range

IQR

Multiple Box-plots

Upper Quartile

Lower Quartile

Outlier

Distribution of data Distribution of data values around the meanvalues around the mean

MEAN17 24 29 36 47 51.3 52 66 67 81 94

MEAN50 51 51 51 51 51.3 51 51 51 51 55

Variance

17 24 29 36 47 52 mean=34.16 years _

(x-x)17 - 34.16 -17.1624 – 34.16 -10.1629 – 34.16 -5.1636 – 34.16 1.8347 – 34.16 12.8352 – 34.16 17.83 0

Variance

17 24 29 36 47 52 mean=34.16 _ _

(x-x) (x-x)2

17 -17.16 294.6424 -10.16 103.3629 -5.16 26.6936 1.83 3.3647 12.83 164.6952 17.83 318.02 0 910.81

Variance (s2)

_S2 = (x-x)2

n-1

S2= 910.81 5

S2=182.16

17 24 29 36 47 52

Mean = 34.16 years

Variance = 182.2

Standard deviation Standard deviation (s)(s)

_

Std deviation (s) = √ (x-x)2

n-1

Std deviation = √ 182.16

= 13.49

17 24 29 36 47 52

Mean = 34.16 yearsSD = 13.49

Coefficient of Variation (CV) = SD / Mean = 0.39

Measure of variability for comparison of different scales

What central measure goes What central measure goes with what measure of with what measure of spread?spread?

•Mean (SD)Mean (SD)

•Median (IQR)Median (IQR)

Summary Summary

•Do not underestimate value Do not underestimate value of looking at the dataof looking at the data

•Gives a feel for the data Gives a feel for the data before testing or modellingbefore testing or modelling

•Check for missing dataCheck for missing data

•Check for outliersCheck for outliers

From Jan 2010 IBM acquired copyright for SPSS

http://www.spss.com/software/statistics/stats-standard/

http://www.spss.com/software/statistics/whats-new/

http://en.wikipedia.org/wiki/File:SPSS_An_IBM_Company_logo.svg

Implementing Kaplan-Implementing Kaplan-Meier in SPSSMeier in SPSS

From Colorectal.sav you need to From Colorectal.sav you need to specify:specify:

• Survival time – time from surgery Survival time – time from surgery (tfsurg)(tfsurg)

• Status – Dead = 1, censored = 0 Status – Dead = 1, censored = 0 (dead)(dead)

• Factor – e.g. hypertension Factor – e.g. hypertension comorbidity (hyperco)comorbidity (hyperco)

• Select plot of survivalSelect plot of survival

From Colorectal.sav you need to From Colorectal.sav you need to specify:specify:

• Survival time – time from surgery Survival time – time from surgery (tfsurg)(tfsurg)

• Status – Dead = 1, censored = 0 Status – Dead = 1, censored = 0 (dead)(dead)

• Factor – e.g. hypertension Factor – e.g. hypertension comorbidity (hyperco)comorbidity (hyperco)

• Select plot of survivalSelect plot of survival

Implementing Kaplan-Meier plot in Implementing Kaplan-Meier plot in SPSSSPSS

Select Select options to options to obtain plot obtain plot and median and median survivalsurvival

Survival curves for women with glioma by Survival curves for women with glioma by diagnosis.diagnosis.

Bland J M , Altman D G BMJ 2004;328:1073

PracticalPracticalRead LDL.sav or colorectal.sav into SPSS (19) and explore the different types of data using appropriate tables and graphs

Data available at MyDundee https://my.dundee.ac.uk/webapps/cmsmain/webui/_xy-2283598_4-t_AueCBgz2or DEBU website (https://medicine.dundee.ac.uk/dundee-epidemiology-and-biostatistics-unit-debu)

https://my.dundee.ac.uk/webapps/cmsmain/webui/_xy-2283598_4-t_AueCBgz2

https://my.dundee.ac.uk/webapps/cmsmain/webui/_xy-2283598_4-t_AueCBgz2

http://www.spss.com/software/statistics/stats-standard/

http://en.wikipedia.org/wiki/File:SPSS_An_IBM_Company_logo.svg

Download - Data: Presentation and Description Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Top Related