Data: Presentation and Description
Peter T. Donnan
Professor of Epidemiology and Biostatistics
Statistics for Health Statistics for Health ResearchResearch
OverviewOverview
•What is Data?What is Data?
•Summarising dataSummarising data
•Displaying dataDisplaying data
•SPSSSPSS
•What is Data?What is Data?
•Summarising dataSummarising data
•Displaying dataDisplaying data
•SPSSSPSS
Why have you Why have you collected data?collected data?
•Most important question!Most important question!
•Related to testing Related to testing hypotheseshypotheses
•If you have not got any If you have not got any hypotheses – Get some! hypotheses – Get some!
•Return to laterReturn to later
DATA – Where from?DATA – Where from?
•All data is a Sample – All data is a Sample – a subset of populationa subset of population
•How was it collected?How was it collected?
•Potential for bias?Potential for bias?
Extrapolating from the Extrapolating from the sample to populationsample to population
Illustrations Ian Christie, Orthopaedic & Trauma Surgery, Copyright
2002 University of Dundee
Quantitative Data?Quantitative Data?
•Observation or measurement of Observation or measurement of one or more variablesone or more variables
•Variable is any quantity measured Variable is any quantity measured on a scaleon a scale
•Unit of analysis can be person, Unit of analysis can be person, group (e.g. practice), specimen, group (e.g. practice), specimen, timetime
•Multilevel – patient and practiceMultilevel – patient and practice
Cross-classified 3 level multilevel Cross-classified 3 level multilevel modelmodel
Practice levelj
Patient leveli
Hospitalk
StatisticsStatistics
Statistics encompasses - Statistics encompasses -
1.1.Design of study;Design of study;
2.2.Methods of collecting, Methods of collecting, and summarising data; and summarising data;
3.3.Analysing and drawing Analysing and drawing appropriate conclusions appropriate conclusions from datafrom data
Variable typesVariable types
•Categorical Categorical (qualitative)(qualitative)–E.g. type of drug, eye E.g. type of drug, eye colour, smokercolour, smoker
•NumericalNumerical (quantitative) (quantitative)–E.g. age, birth weight, BPE.g. age, birth weight, BP
CategoricalCategorical NominalNominal
Categories are Categories are mutually mutually
exclusive and exclusive and unorderedunordered
Eg Blood group Eg Blood group type (A/B/AB/O)type (A/B/AB/O)
OrdinalOrdinalCategories are Categories are
mutually mutually exclusive and exclusive and
orderedordered
Eg Disease stage Eg Disease stage (mild/moderate/(mild/moderate/
severe)severe)
Binary - two categories (yes, no)
NumericalNumerical DiscreteDiscrete
Integer Integer values, often values, often
countscounts
Eg number of Eg number of cigarettes cigarettes smokedsmoked
ContinuousContinuousTakes any Takes any value in a value in a range of range of valuesvalues
Eg Height in Eg Height in cm, cm,
cholesterolcholesterol
Organisation of dataOrganisation of data
Generally each variable in separate columns and one row per subjectGenerally each variable in separate columns and one row per subject
Subject Age Gender Score
1 28 1 15
2 56 2 11
3 43 1 22
Generally each variable in separate columns and one row per subjectGenerally each variable in separate columns and one row per subject
Subject Age Gender Score
1 28 1 15
2 56 2 11
3 43 1 22
11stst step in step in analysis?analysis?
Look at the data!
Display and summarise Display and summarise datadata
•To get a feel for the dataTo get a feel for the data
•To spot errors and missing To spot errors and missing datadata
•Assess the range of valuesAssess the range of values
•Also ..Also ..
Caregorical dataCaregorical data1. Campylobactor 21. Giardia
2. Campylobactor 22. Crytosporidium
3. Escherichia coli 0157 23. Crytosporidium
4. Shigella sonnei 24. Campylobactor
5. Crytosporidium 25. Shigella sonnei
6. Giardia 26. SRSV
7. Crytosporidium 27. Crytosporidium
8. Campylobactor 28. Campylobactor
9. Campylobactor 29. Giardia
10. Crytosporidium 30. Giardia
11. Giardia 31. Escherichia coli 0157
12. Shigella sonnei 32. Shigella sonnei
13. SRSV 33. Crytosporidium
14. Giardia 34. SRSV
15. Escherichia coli 0157 35. Campylobactor
16. Campylobactor 36. Campylobactor
17. Giardia 37. Campylobactor
18. SRSV 38. Giardia
19. Campylobactor 39. Escherichia coli 0157
20. Crytosporidium 40. Campylobactor
InfectionInfection N (%)N (%)
CampylobactorCampylobactor 12 (30.0)12 (30.0)
CryptosporidiumCryptosporidium 9 (22.5)9 (22.5)
GiardiaGiardia 8 (20.0)8 (20.0)
SRSVSRSV 5 (12.5)5 (12.5)
Escherichia coli 0157Escherichia coli 0157 3 (7.5)3 (7.5)
ShigellaShigella 3 (7.5)3 (7.5)
TotalTotal 40 (100)40 (100)
Summarised by frequencies Summarised by frequencies or percentageor percentage
Numerical dataNumerical data
•Frequency distributions for Frequency distributions for continuous variable continuous variable unfeasibly largeunfeasibly large
•Grouping may be Grouping may be necessary for presentationnecessary for presentation
Age group(years) Frequency
RelativeFrequency
(%)
Cumulative relative
frequency (%)
0-4 59 12.2 12.2
5-9 83 17.1 29.3
10-14 94 19.4 48.7
15-19 72 14.8 63.5
20-24 61 12.6 76.1
25-29 48 9.9 86.0
30-34 36 7.4 93.4
35-49 32 6.6 100
485 100
Frequency distribution for Frequency distribution for continuous variablecontinuous variable
Baseline measure N (%)
4.0 52 (3.1)
4.1 51 (3.0)
4.2 49 (2.9)
4.3 65 (3.9)
4.4 60 (3.6)
4.5 80 (4.8)
4.6 88 (5.2)
4.7 99 (5.9)
4.8 94 (5.6)
4.9 84 (5.0)
5.0 68 (4.1)
5.1 66 (3.9)
5.2 79 (4.7)
5.3 74 (4.4)
5.4 75 (4.5)
5.5 75 (4.5)
5.6 70 (4.2)
5.7 60 (3.6)
Baseline group
N (%)
4.0 to 4.4 277 (16.5)
4.5 to 4.9 445 (26.5)
5.0 to 5.4 362 (21.6)
5.5 to 5.9 340 (20.3)
6.0 to 6.9 253 (15.1)
TotalTotal
1677
Guide for grouping dataGuide for grouping data
•Obtain min and max valuesObtain min and max values
•Choose between 5 and 15 Choose between 5 and 15 intervalsintervals
•Summarise but not obscure Summarise but not obscure data especially continuous datadata especially continuous data
•Intervals of equal widthIntervals of equal width– Good but not essentialGood but not essential– Remember to label tables!Remember to label tables!
Take care with missing Take care with missing valuesvalues
•SPSS gives % missing in output if SPSS gives % missing in output if missing left blank in datamissing left blank in data
•Careful in reporting % as Careful in reporting % as percentage of observed values percentage of observed values oror percentage of all subjectspercentage of all subjects
•These will differ!These will differ!
•Can use missing code (often 9) to Can use missing code (often 9) to make missing explicit in outputmake missing explicit in output
GraphsGraphs
•SimplicitySimplicity
•ConsistencyConsistency
•Not duplicating tables or Not duplicating tables or texttext
•Remember TitleRemember Title
•Remember Label axesRemember Label axes
Graphs – Categorical Graphs – Categorical datadata
•Bar chartsBar charts
•Pie chartsPie charts
Bar chartsBar charts
• Used to display categorical (or Used to display categorical (or discrete numerical data)discrete numerical data)
• One bar per categoryOne bar per category
• Height of bar equals its frequencyHeight of bar equals its frequency
• Each bar same width and equally Each bar same width and equally spacedspaced
• Space between each barSpace between each bar
• Vertical axis must start at zeroVertical axis must start at zero
Most common cancer deaths Most common cancer deaths in UK, 2009in UK, 2009
Plots and Statistics from CRUK website http://info.cancerresearchuk.org
Pie chartsPie charts
•Displays one variable Displays one variable onlyonly
•Compare 2 groups Compare 2 groups using 2 chartsusing 2 charts
BBC Sports Personality of the Year 2008
HoyHamiltonAdlingtonAinslieCalzagheMurrayCookeOhuruoguWigginsRomero
But avoid 3-dimensional plots!
Graphs – Numerical dataGraphs – Numerical data
•HistogramsHistograms•Frequency polygonFrequency polygon•Cumulative frequency Cumulative frequency polygonpolygon
•Scatter plotsScatter plots•Box plotsBox plots
HistogramsHistograms
•Like bar charts but no Like bar charts but no spacesspaces
•y axis always begins at y axis always begins at zerozero
•Area of bar represents the Area of bar represents the frequency in each groupfrequency in each group
Check data carefully
Florence Nightingale’s Florence Nightingale’s ‘Coxcomb’ diagram of ‘Coxcomb’ diagram of
Mortality in the Crimea Mortality in the Crimea WarWar
Summary measures – Summary measures – Numerical Numerical datadata•Central Location (average)Central Location (average)
•Spread or variability Spread or variability (distance of each data (distance of each data point from the average)point from the average)
Central LocationCentral Location•MeanMean
•Median Median
•Mode - most frequent valueMode - most frequent value
MeanMean _x = x1 + x2 +x3+ ….. +
xn
N
Often written as ∑xi / NWhere Sigma or ∑ is ‘Sum of’
2.75 2.86 3.37 2.76 2.62 3.49 3.05 3.12 _ x = 24.02 8
= 3 litres
MeanMean• AdvantagesAdvantages
– Uses all data valuesUses all data values– Very amenable to statistical Very amenable to statistical
analysis; most models deal with analysis; most models deal with meanmean
• Disadvantages (advantages to Disadvantages (advantages to politicians and estate agents!)politicians and estate agents!)– Distorted by outliersDistorted by outliers– Distorted by skewed dataDistorted by skewed data
MedianMedian
Arrange values in increasing Arrange values in increasing orderorderMedian is the middle valueMedian is the middle value
2.62 2.75 2.76 [2.86 3.05] 3.12 3.37 3.49
Median = 2.86 + 3.05 = 2.96 litres 2
MedianMedian
MedianMedian
• AdvantagesAdvantages– Not distorted by outliersNot distorted by outliers– Not distorted by skewed dataNot distorted by skewed data
• DisadvantagesDisadvantages– Ignores most of the informationIgnores most of the information– Less amenable to statistical Less amenable to statistical
modellingmodelling
Measures of spreadMeasures of spread17 24 29 36 [47 52] 66 67 81 94
Mean = 51.3 Median = 49.5
50 51 51 51 [51 51] 51 51 51 55
Mean = 51.3 Median = 51
RangeRange17 24 29 36 [47 52] 66 67 81 94
Range 17-94 or 77
50 51 51 51 [51 51] 51 51 51 55
Range 50-55 or 5
Range from Range from percentilespercentiles•Data ordered from smallest Data ordered from smallest
to largest valueto largest value
•PercentilesPercentiles
•Deciles –data in equal 10thsDeciles –data in equal 10ths
•Quartiles = data in equal Quartiles = data in equal 4ths4ths
Interquartile range Interquartile range (IQR)(IQR)
4 5 7 | 9 10 12 | 14 19 26 | 39 40 42
8 (lower quartile) 32.5 (upper
quartile)
Interquartile range (IQR) = 32.5 - 8
= 24.5
Highly satisfiedSomewhatsatisfied
NeutralSomewhatdissatisfied
Highly dissatisfied
Job satisfaction
80
70
60
50
40
30
20
10
Age in y
ears
6,397
6,366
6,286
Median
Range
IQR
Multiple Box-plots
Upper Quartile
Lower Quartile
Outlier
Distribution of data Distribution of data values around the meanvalues around the mean
MEAN17 24 29 36 47 51.3 52 66 67 81 94
MEAN50 51 51 51 51 51.3 51 51 51 51 55
Variance
17 24 29 36 47 52 mean=34.16 years _
(x-x)17 - 34.16 -17.1624 – 34.16 -10.1629 – 34.16 -5.1636 – 34.16 1.8347 – 34.16 12.8352 – 34.16 17.83 0
Variance
17 24 29 36 47 52 mean=34.16 _ _
(x-x) (x-x)2
17 -17.16 294.6424 -10.16 103.3629 -5.16 26.6936 1.83 3.3647 12.83 164.6952 17.83 318.02 0 910.81
Variance (s2)
_S2 = (x-x)2
n-1
S2= 910.81 5
S2=182.16
17 24 29 36 47 52
Mean = 34.16 years
Variance = 182.2
Standard deviation Standard deviation (s)(s)
_
Std deviation (s) = √ (x-x)2
n-1
Std deviation = √ 182.16
= 13.49
17 24 29 36 47 52
Mean = 34.16 yearsSD = 13.49
Coefficient of Variation (CV) = SD / Mean = 0.39
Measure of variability for comparison of different scales
What central measure goes What central measure goes with what measure of with what measure of spread?spread?
•Mean (SD)Mean (SD)
•Median (IQR)Median (IQR)
Summary Summary
•Do not underestimate value Do not underestimate value of looking at the dataof looking at the data
•Gives a feel for the data Gives a feel for the data before testing or modellingbefore testing or modelling
•Check for missing dataCheck for missing data
•Check for outliersCheck for outliers
From Jan 2010 IBM acquired copyright for SPSS
Implementing Kaplan-Implementing Kaplan-Meier in SPSSMeier in SPSS
From Colorectal.sav you need to From Colorectal.sav you need to specify:specify:
• Survival time – time from surgery Survival time – time from surgery (tfsurg)(tfsurg)
• Status – Dead = 1, censored = 0 Status – Dead = 1, censored = 0 (dead)(dead)
• Factor – e.g. hypertension Factor – e.g. hypertension comorbidity (hyperco)comorbidity (hyperco)
• Select plot of survivalSelect plot of survival
From Colorectal.sav you need to From Colorectal.sav you need to specify:specify:
• Survival time – time from surgery Survival time – time from surgery (tfsurg)(tfsurg)
• Status – Dead = 1, censored = 0 Status – Dead = 1, censored = 0 (dead)(dead)
• Factor – e.g. hypertension Factor – e.g. hypertension comorbidity (hyperco)comorbidity (hyperco)
• Select plot of survivalSelect plot of survival
Implementing Kaplan-Meier plot in Implementing Kaplan-Meier plot in SPSSSPSS
Select Select options to options to obtain plot obtain plot and median and median survivalsurvival
Survival curves for women with glioma by Survival curves for women with glioma by diagnosis.diagnosis.
Bland J M , Altman D G BMJ 2004;328:1073
PracticalPracticalRead LDL.sav or colorectal.sav into SPSS (19) and explore the different types of data using appropriate tables and graphs
Data available at MyDundee https://my.dundee.ac.uk/webapps/cmsmain/webui/_xy-2283598_4-t_AueCBgz2or DEBU website (https://medicine.dundee.ac.uk/dundee-epidemiology-and-biostatistics-unit-debu)