introductory statistics march 2008 - vula : … · introductory statistics march 2008 ... •types...

16
1 INTRODUCTORY STATISTICS March 2008 Francesca Little •INTRODUCTION : What is Statistics? •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA •SUMMARIZING DATA •STATISTICAL INFERENCE •COMPARING GROUPS Continuous data – Parametric Non-parametric Categorical data •MEASURES OF DISEASE FREQUENCY AND EFFECT •SAMPLE SIZE ESTIMATION •ESTIMATION VERSUS HYPOTHESIS TESTING INTRODUCTION: Statistics = the art of decision making Collection of data Organization of data Summarizing and displaying data Analysis of data Interpretation References : 1. M Pagano & K Gauvreau. Principles of Bio-Statistics, 2nd edition, 2000, Duxbury. 2. J T Connor. The Value of a p-Valueless Paper. Am J Gastroenterol 2004;99:1638-1640. 3. Armitage, P. & Berry, G. : Statistical methods in medical research. 2nd edition,1987, Blackwell 4. Fisher, LD & Van Belle, G. Biostatistics. A methodology for the Health Sciences. Wiley, 1993. 5. Beagle etal, Introduction to Epidemiology, WHO 1993

Upload: ngokhanh

Post on 29-Aug-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

1

INTRODUCTORY STATISTICS

March 2008

Francesca Little

•INTRODUCTION : What is Statistics?

•TYPES OF DATA

•GRAPHICAL METHODS FOR DISLAYING DATA

•SUMMARIZING DATA

•STATISTICAL INFERENCE

•COMPARING GROUPS

Continuous data – Parametric

Non-parametric

Categorical data

•MEASURES OF DISEASE FREQUENCY AND EFFECT

•SAMPLE SIZE ESTIMATION

•ESTIMATION VERSUS HYPOTHESIS TESTING

INTRODUCTION:

Statistics = the art of decision making

Collection of data

Organization of data

Summarizing and displaying data

Analysis of data

Interpretation

References :

1. M Pagano & K Gauvreau. Principles of Bio-Statistics, 2nd edition, 2000,

Duxbury.

2. J T Connor. The Value of a p-Valueless Paper. Am J Gastroenterol

2004;99:1638-1640.

3. Armitage, P. & Berry, G. : Statistical methods in medical research. 2nd

edition,1987, Blackwell

4. Fisher, LD & Van Belle, G. Biostatistics. A methodology for the Health Sciences.

Wiley, 1993.

5. Beagle etal, Introduction to Epidemiology, WHO 1993

Page 2: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

2

EXAMPLE : Randomised Clinical Trial to investigate and compare the efficacy

of two treatment regimes in the treatment of malaria

Data include information on

•Patient demographics

•Baseline diagnostics

•Treatment and Outcome

•Kinetics

Statistical Software

•SPlus

•Stata

•Excel

•Statistica

DEMOGRAPHIC AND BASELINE DIAGNOSTIC DATA

+------------------------------------------------------------------+

| subject site age gender weight feverhis pardens0 |

|------------------------------------------------------------------|

1. | MOC001 SiteA 2 M 10.7 Y 46 |

2. | MOC002 SiteA 2 F 10.7 Y 523 |

3. | MOC003 SiteA 13 F 31.5 Y 174 |

4. | MOC004 SiteA 6 F 20 Y 152 |

5. | MOC006 SiteA 3 F 11.7 Y 372 |

|------------------------------------------------------------------|

6. | MOC009 SiteA 31 M 55 Y 239 |

7. | MOC010 SiteA 2 M 10.6 Y 30595 |

8. | MOC011 SiteA 12 M 33.6 Y 217 |

9. | MOC012 SiteA 2 M 11.6 Y 5338 |

10. | MOC013 SiteA 4 F 15 Y 78 |

|------------------------------------------------------------------|

11. | MOC016 SiteA 14 F 33.1 Y 94 |

12. | MOC017 SiteA 4 F 12.4 Y 108 |

13. | MOC018 SiteA 14 F 47 N 251 |

14. | MOC019 SiteA 6 M 14.5 N 502 |

15. | MOC020 SiteA 12 F 20.3 N 3193 |

|------------------------------------------------------------------|

16. | MOC021 SiteA 12 M 20.5 N 284 |

17. | MOC022 SiteA 11 M 21.2 N 1045 |

18. | MOC023 SiteA 7 F 18.1 N 4821 |

19. | MOC024 SiteA 6 M 13.4 N 347 |

20. | MOC025 SiteA 6 F 19.1 Y 522 |

|------------------------------------------------------------------|

21. | MOC026 SiteA 5 M 15.1 Y 3499 |

22. | MOC027 SiteA 6 F 16.4 Y 614 |

23. | MOC028 SiteA 8 F 25.6 N 153 |

24. | MOC029 SiteA 14 M 36.6 N 325 |

25. | MOC030 SiteA 9 F 28.1 N 536 |

|------------------------------------------------------------------|

Page 3: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

3

TREATMENT AND OUTCOME DATA

+---------------------------------------------------------------+

| subject A1dose Bdose A2dose trt outcome |

|---------------------------------------------------------------|

1. | MOC001 93.45795 0 4.672897 trtA ACPR |

2. | MOC002 46.72897 0 2.336449 trtA ACPR |

3. | MOC003 31.74603 0 1.587302 trtA ACPR |

4. | MOC004 25 0 1.25 trtA ACPR |

5. | MOC006 42.73504 6.410256 2.136752 trtB ACPR |

|---------------------------------------------------------------|

6. | MOC009 27.27273 5.454545 1.363636 trtB LTFU |

7. | MOC010 94.33962 9.433962 4.716981 trtB ACPR |

8. | MOC011 29.76191 4.464286 1.488095 trtB ACPR |

9. | MOC012 43.10345 6.465517 2.155172 trtB ACPR |

10. | MOC013 33.33333 0 1.666667 trtA ACPR |

|---------------------------------------------------------------|

11. | MOC016 45.31722 9.063444 2.265861 trtB ACPR |

12. | MOC017 40.32258 6.048387 2.016129 trtB ACPR |

13. | MOC018 31.91489 0 1.595745 trtA ACPR |

14. | MOC019 34.48276 5.172414 1.724138 trtB ACPR |

15. | MOC020 49.26109 7.389163 2.463054 trtB ACPR |

|---------------------------------------------------------------|

16. | MOC021 48.78049 0 2.439024 trtA ACPR |

17. | MOC022 47.16981 7.075471 2.35849 trtB ACPR |

18. | MOC023 55.24862 8.287292 2.762431 trtB ACPR |

19. | MOC024 37.31343 5.597015 1.865672 trtB ACPR |

20. | MOC025 52.35602 0 2.617801 trtA ACPR |

|---------------------------------------------------------------|

21. | MOC026 33.11258 0 1.655629 trtA ACPR |

22. | MOC027 30.48781 0 1.52439 trtA ACPR |

23. | MOC028 39.0625 5.859375 1.953125 trtB ACPR |

24. | MOC029 40.98361 8.196722 2.049181 trtB ACPR |

25. | MOC030 35.58719 0 1.779359 trtA ACPR |

|---------------------------------------------------------------|

Trt A = 2 drugs A1 and A2trtB = trtA +third drug B

PK DATA

+---------------------------------------------------------------------+

| subject AUCall_A1 Cmax_A1 Tmax_A1 AUCall_A2 Tmax_A2 Cmax_A2 |

|---------------------------------------------------------------------|

1. | MOC001 1481.1 123.655 1 1323.425 1 386.898 |

2. | MOC002 694.0709 106.517 2 4619.423 2 404.834 |

3. | MOC003 1019.58 103.838 1 5512.589 1 433.755 |

4. | MOC004 742.2638 75.3804 1 3275.157 0 312.978 |

5. | MOC006 548.6373 72.0582 1 1907.779 1 284.006 |

|---------------------------------------------------------------------|

6. | MOC009 745.3734 69.0091 2 1414.887 1 270.758 |

7. | MOC010 626.12 95.014 1 2166.401 1 469.149 |

8. | MOC011 906.2881 86.4171 1 2749.568 1 481.745 |

9. | MOC012 734.5026 70.785 2 1980.176 2 282.227 |

10. | MOC013 855.3212 83.2393 3 1942.275 1 340.038 |

|---------------------------------------------------------------------|

11. | MOC016 976.5257 94.5215 1 2886.02 1 625.695 |

12. | MOC017 531.5585 78.3498 1 1050.526 1 198.951 |

13. | MOC018 899.3121 105.459 1 1608.391 1 342.87 |

14. | MOC019 762.8975 111.888 1 1402.796 1 356.158 |

15. | MOC020 884.7603 111.346 2 2041.504 2 434.162 |

|---------------------------------------------------------------------|

16. | MOC021 614.1311 109.219 1 1791.219 1 493.847 |

17. | MOC022 745.5765 88.314 1 2049.41 1 449.877 |

18. | MOC023 34.721 6.4265 1 113.2341 1 18.1723 |

19. | MOC024 643.1728 64.3864 2 1281.929 2 196.15 |

20. | MOC025 1202.952 124.234 1 3983.165 1 647.307 |

|---------------------------------------------------------------------|

21. | MOC026 615.6737 75.757 1 1901.963 1 264.39 |

22. | MOC027 1199.341 139.315 1 3936.105 1 591.919 |

23. | MOC028 1317.269 127.166 1 3276.55 1 582.077 |

24. | MOC029 1158.077 105.017 1 3078.129 1 388.982 |

25. | MOC030 1078.056 94 1 2668.494 1 509.955 |

|---------------------------------------------------------------------|

Page 4: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

4

TYPES OF DATA

NOMINAL, the values fall into unordered categories

eg., site = SiteA or SiteB

gender = male or female

outcome= ACPR or LTFU or ETF or LPF or LCF

ORDINAL, where order is important, the values in one category is in some

way less or worse than the values in another category

eg., severity of symptoms = none, mild, moderate, severe

DISCRETE, where both order and magnitude is important but the variables

can take on only isolated values, usually integers, that differ by fixed amounts

eg., number of children for one woman

number of organisms in sample

CONTINUOUS, where the data present measurable quantities that can be

measured to any degree of accuracy

eg., parasite density on day 0

Area under the curve for Trt A1

Age

GRAPHS

BAR CHARTS to display nominal or ordinal data

1 06

12

25

3

010

20

30

40

50

60

70

80

90

10

011

0

Fre

que

ncy

0 ACP R LTFU E TF L PF LCFou tc ome

8 2.8 1

9 .3 75

1.5 633 .90 6

2.3 44

01

02

03

04

05

06

07

08

09

01

00

Perc

en

t

0 A CP R L TFU ET F L PF LCFoutcom e

The heights of the vertical bars show either the frequency or relative

frequency (percent) of observations in each class.

Page 5: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

5

HISTOGRAMS illustrate frequency distributions for discrete or continuous

data

Take care when choosing width of bins, i.e., X-axis.

Remember that it is the area of the bins that illustrate the

frequency.

02

46

810

12

14

16

18

20

Fre

quency

0 10 20 30 40 50 60 70 80 90 100110120130140150160170180190200Cmax_A1

ILLUSTRATING THE RELATIONSHIP BETWEEN TWO VARIABLES

Two categorical variables:

Bar graphs of one variable by the other variable

010

20

30

40

50

60

70

count of

patn

o

TrtA TrtB

SiteA SiteB

Page 6: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

6

010

20

30

40

50

60

70

count of

patn

o

TrtA TrtB

ACPR LTFU

ETF LPF

LCF

Two continuous variables – Scatter plots

The basic graphical technique for the two-variable situation is the scatter

diagram. In general the data refer to a number of individuals, each of which

provides observations on two variables. In the scatter diagram each variable is

allotted one of two co-ordinate axes and each observation defines a point, of

which the co-ordinates are the observed values of the two variables. The

scatter diagram gives a compact illustration of the relationship between the two

variables.

02

04

06

08

0S

tuS

itWe

ightK

G

0 20 40 60StuSitDOBAge

Page 7: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

7

020

40

60

80

10

0

0 20 40 60StuSitDOBAge

Fitted values StuSitWeightKG

020

40

60

80

0 20 40 60StuSitDOBAge

Fitted values StuSitW eightKG

You may try to fit linear or quadratic lines to the data to summarize the

suggested relationship :

EXERCISE:

Think about your specific research project and consider the nature of the

expected results in terms of the type of information that you are collecting

or generating – will you be able to organize this information in terms of

variables and observations and can you identify what kind of variables you

will have.

Use the data from the malaria example discussed earlier and group all

variables according to their type, i.e, nominal, ordinal, discrete, continuous.

Discuss how you would illustrate the information collected on gender and

how you would further illustrate the relationship between gender and

outcome.

You are interested in whether there is a relationship between the dose of

trt A1 received and the AUC for trt A1. How would you illustrate this

relationship?

Page 8: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

8

The graph below illustrates the use of other medications for the two

sites. Comment and compare the two sites with respect to use of other

medications.

010

20

30

40

50

60

70

80

count of

patn

o

SiteA SiteB

Antipyretic Antimalarial

Topical Rehydration

The graph below illustrates the relationship between Cmax for trt A1 and

dose. Comment on this relationship.

050

100

150

200

20 40 60 80 100A1dose

Fitted values Cmax_A1

Page 9: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

9

SUMMARIZING DATA

CATEGORICAL DATA

Create frequency distributions that gives the number or percentage of observations in each category of

the nominal or ordinal variable.

For example

outcome | Freq. Percent Cum.

------------+-----------------------------------

ACPR | 106 82.81 82.81

LTFU | 12 9.38 92.19

ETF | 2 1.56 93.75

LPF | 5 3.91 97.66

LCF | 3 2.34 100.00

------------+---------------------------------

Total| 128 100.00

Use two-way frequency tables to summarize the relationships between two categorical variables:outcome

trt | ACPR LTFU ETF LPF LCF | Total

-----------+-------------------------------------------------------+----------

trtA | 50 5 1 4 2 | 62

| 80.65 8.06 1.61 6.45 3.23 | 100.00

-----------+-------------------------------------------------------+----------

trtB | 56 7 1 1 1 | 66

| 84.85 10.61 1.52 1.52 1.52 | 100.00

-----------+-------------------------------------------------------+----------

Total | 106 12 2 5 3 | 128

| 82.81 9.38 1.56 3.91 2.34 | 100.00

Continuous data

Recall the histogram of Cmax_A1.

How can we summarize the information on Cmax_A1 values for our sample using a few numbers?

What is it that we want to know about the Cmax_A1 values?

Measures of central tendency:

Mean

Median

Mode

Measures of dispersion:

Range

Interquartile range

Variance and standard deviation

Coefficient of variation

02

46

81

01

21

41

61

82

0F

reque

ncy

0 10 20 30 40 50 60 70 80 90 100110120130140150160170180190200Cmax_A1

Page 10: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

10

THE MEAN

= the average value

= the sum of all the values / number of observations+-----+

| age |

1. | 31 |

2. | 6 |

3. | 21 |

4. | 27 |

5. | 15 |

6. | 3 |

7. | 17 |

8. | 35 |

9. | 3 |

10. | 56 |

11. | 23 |

12. | 3 | mean age = (31+6+21+27+15+3+17+35+3+56+23+3)/12 = 20+-----+

Say, the age of 56 was an error and should really have been recorded as 36,

Then the new mean is mean age =

(31+6+21+27+15+3+17+35+3+36+23+3)=18.33

The mean takes into consideration the actual magnitude of the values and is

very sensitive to unusual values.

nxx

n

i

i/

1

∑=

=

01

23

4F

reque

ncy

0 5 10 15 20 25 30 35 40 45 50 55 60StuSitDOBAge

THE MEDIAN = the central observation,

50% of the values lie below this value and 50% lie above.

For n observations, where n is odd, the median is the [(n+1)/2]th largest value.

When n is even, the median is the average of the two middle observations, the

(n/2)th and [(n/2)+1]th observations.+-----+

| age |

|-----|

1. | 31 |

2. | 6 |

3. | 21 |

4. | 27 |

5. | 15 |

6. | 3 |

7. | 17 |

8. | 35 |

9. | 3 |

10. | 56 |

11. | 23 |

12. | 3 |

+-----+

+-----+

| age |

|-----|

1. | 3 |

2. | 3 |

3. | 3 |

4. | 6 |

5. | 15 |

6. | 17 |

7. | 21 |

8. | 23 |

9. | 27 |

10. | 31 |

11. | 35 |

12. | 56 |

+-----+

N=12 is even � (n/2=6)th obs= 17

[(n/2)+1]=7th obs = 21

� median=(17+21)/2 = 19

Changing 56 to 36 does not affect

the median because it is still the

largest value and so the rank or

order of the observations have

remained the same. The median is

much more robust than the mean.

Page 11: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

11

In the previous example the mean and median gave similar values for the

central age. However, this is not always the case.

Cmax_A1 has a

symmetrical distributions,

thus the mean =84.67 is

very similar to the

median=81.33

05

.0e-0

61

.0e

-05

1.5

e-0

52

.0e-0

5D

en

sity

-1 99999 199999 299999 399999 4999990 pardens

Parasite density on day 0 has a

very skew distribution, thus the

mean=33533 is very different from

the median=5280. For very skew

distributions, the median is a more

appropriate measure of the central

position than the mean.

0.0

05

.01

.015

De

nsity

0 50 100 150 200Cmax_A1

THE MODE = the value that occurs most frequently.+-----+

| age |

|-----|

1. | 3 |

2. | 3 |

3. | 3 | So for the 12 ages, the mode = 3

4. | 6 |

5. | 15 |

6. | 17 |

7. | 21 |

8. | 23 |

9. | 27 |

10. | 31 |

11. | 35 |

12. | 56 |

+-----+

More appropriate for discrete data, where you may be interested in the most

frequently observed response, e.g., when your variable measures the number of

children for one mother, the mode will give you the most common family size.

Page 12: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

12

MEASURES OF SPREAD

RANGE = difference between largest and smallest value

INTERQUARTILE RANGE = the range of the central 50% of the values

+-----+

| age |

|-----|

1. | 3 |

2. | 3 |

3. | 3 |

4. | 6 |

5. | 15 |

6. | 17 |

7. | 21 |

8. | 23 |

9. | 27 |

10. | 31 |

11. | 35 |

12. | 56 |

+-----+

The range=56-3=53,

more commonly expressed as the two limits, (3-56).

To calculate the interquartile range, we need to identify the 25th

and 75th percentiles for the data.

In general, the kth percentile is the average of the observations with rank=nk/100 and (nk/100 + 1) if nk/100 is an integer. If nk/100 is not an integer, the kth percentile is the (j+1)th largest observation where j is the largest integer less than nk/100.

For our example,

n=12 �12x25/100=3 � 25th percentile = (3+6)/2 = 3, so 25th

percentile=(3+6)/2=4.5

12x75/100=9 � 75th percentile = 9th largest obs. = 27, so 75th percentile= (27+31)/2 = 29

Thus interquartile range=29-4.5=24.5, also expressed as (4.5-29).

VARIANCE = the amount of variability around the mean

∑=

=

n

i

ixx

ns

1

22)(

1

1

It can be thought of as the average of the squared deviations

from the mean.

Variance = 2798/11 = 254.36

27980240Total

289-173

9323

12963656

289-173

2251535

9-317

289-173

25-515

49727

1121

196-146

1211131

(age-mean)*2age-meanage

Page 13: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

13

THE STANDARD DEVIATION:

The units of measurement of the variance are not the same as the units of

measurement of the variable for which you have calculated the variance. For

this reason we calculate the standard deviation which is the positive square

root of the variance. For the 12 ages, the standard deviations will thus equal

95.1536.254 ==s

This now has the same units as age.

The variable with the larger standard deviation is the more variable.

However, if variables have different units of measurement, it is not

appropriate to compare the standard deviations.

To get rid of the units of measurement, we calculate the

COEFFICIENT OF VARIABILITY as

74.79100)20/95.15(100 =×=×=

x

sCV

Which expressed the standard deviation as a percentage of the

mean, or the variability as a percentage of the central position.

It is dimensionless and can be used to evaluate the relative

variability of two groups of observations.

Illustrating (graphing) the measures of central position and spread:

05

10

15

20

25

30

35

40

45

50

55

60

Stu

SitD

OB

Age

maximum

adjacent value=most extreme value not

more than 1.5 times the height of the box

beyond either quartile, so not more than

1.5x(18-3)=22 +18 = 40

upper quartile

median

lower quartile

minimum

Page 14: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

14

Side-by-side box plots are very useful to compare groups :

�median age slightly higher for trtB group than for trtA group

�interquartile range wider for trtB group than for trtA group

�age-distribution for trtA-group skew because of outlying values

05

10

15

20

25

30

35

40

45

50

55

60

trtA trtB

Stu

SitD

OB

Age

Graphs by trt

Sometimes bar graphs with error bars are used to display means and

standard deviations:

Mean Plot (Spreadsheet1 10v*128c)

Mean

Mean±SD SP SP/ART

Var3

-2

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

Var2

�Mean ages for two treatment groups the same, ages in

trtB group slightly less variable than in trtA group

trtA trtB

Page 15: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

15

CLASS EXERCISE:

(From Pagano & Gauvreau)

In Massachusetts, 8 individuals experienced

unexplained episodes of vitamin D intoxication

that required hospitalization; it was thought that

these unusual occurrences might be as a result of excessive supplementation of dairy milk.

Blood levels of calcium and albumin for each

subject at the time of hospitalization are listed.

For these calcium and albumin levels,

calculate the

Mean

Median

Mode

Range

Interquartile range

Standard deviation

Coefficient of variation

If you wish, use a statistical software package to illustrate these measures of central tendency

and spread.

Calcium Albumin

(mmol/l) (g/l)

2.92 43

3.84 42

2.37 42

2.99 40

2.67 42

3.17 38

3.74 34

3.44 42

The tables and graphs below summarise Cmax values for drug A1 for the two treatment groups.From these tables and graphs:Compare the means and medians for the two groups. Discuss which of the two measures of central position is the most appropriate.Comment on the variability of the Cmax values within each treatment group by referring to the standard deviations, the ranges and interquartileranges.Calculate and compare the coefficients of variation of Cmax within the two treatment groups.

. table trt,c(n Cmax_A1 mean Cmax_A1 sd Cmax_A1)

----------------------------------------------------

trt | N(Cmax_A1) mean(Cmax_A1) sd(Cmax_A1)

----------+-----------------------------------------

trtA | 62 87.55352 29.73334

trtB | 63 81.82308 28.72239

----------------------------------------------------

. table trt,c(min Cmax_A1 p25 Cmax_A1 med Cmax_A1 p75 Cmax_A1 max Cmax_A1)

---------------------------------------------------------------------------

trt | min(Cmax_A1) p25(Cmax_A1) med(Cmax_A1) p75(Cmax_A1) max(Cmax_A1)

----------+----------------------------------------------------------------

trtA | 28.2339 64.2546 86.60575 108.267 165.744

trtB | 6.4265 64.3864 80.7916 95.014 152.769

---------------------------------------------------------------------------

05

01

00

150

200

trtA trtB

Graphs by trt

0.0

05

.01

.01

5D

en

sity

0 50 100 150 200Cmax_A1

Page 16: INTRODUCTORY STATISTICS March 2008 - Vula : … · INTRODUCTORY STATISTICS March 2008 ... •TYPES OF DATA •GRAPHICAL METHODS FOR DISLAYING DATA ... Organization of data Summarizing

16

The tables and graphs below summarise Cmax values for drug A2 for the two treatment groups.

From these tables and graphs:

Compare the means and medians for the two groups. Discuss which of the two measures of central position is the

most appropriate.

Comment on the variability of the Cmax values within each treatment group by referring to the standard

deviations, the ranges and interquartile ranges.

Calculate and compare the coefficients of variation of Cmax within the two treatment groups.

. table trt,c(n Cmax_A2 mean Cmax_A2 sd Cmax_A2)

----------------------------------------------------

trt | N(Cmax_A2) mean(Cmax_A2) sd(Cmax_A2)

----------+-----------------------------------------

trtA | 62 340.6208 159.762

trtB | 63 292.24 145.7977

----------------------------------------------------

. table trt,c(min Cmax_A2 p25 Cmax_A2 med Cmax_A2 p75 CmaxA2 max Cmax_A2)

---------------------------------------------------------------------------

trt | min(Cmax_A2) p25(Cmax_A2) med(Cmax_A2) p75(Cmax_A2) max(Cmax_A2)

----------+----------------------------------------------------------------

trtA | 97.0572 236.945 307.553 423.281 1000.9

trtB | 18.1723 195.749 268.026 358.674 712.894

---------------------------------------------------------------------------

0.0

01

.00

2.0

03

.00

4D

en

sity

0 200 400 600 800 1000Cmax_P

020

04

00

600

800

1,0

00

SP SP/ART

Graphs by trt

INTRODUCTORY STATISTICS

continued

PROBABILITY AND STATISTICAL INFERENCE

PROBABILITY:

All statistical summaries and hence decisions are subject to uncertainty.

The appropriate tool for measuring uncertainty is the theory of probability.

In frequentist statistics, probabilities are just relative frequencies that express the number of times that a given outcome was observed as a proportion of the total number of trials.

For example,

for the malaria data, we had 128 subjects, of whom 67 were female. Thus the probability of a female subject is 67/128 = 0.5234.

Of the 128 subjects, 106 had an “adequate clinical response”, hence the probability of successful

treatment was 106/128=0.8281.

Within each treatment group we observed that

50 of the 62 subjects on trt A, had an adequate clinical response � probability = 0.8065

56 of the 66 subjects on trt B had an adequate clinical response � probability = 0.8485.