fk6163 exec 2018 exploratory data analysis...2019/07/02  · ©[email protected] 2018 exploratory...

32
©[email protected] 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163 EXEC 2018

Upload: others

Post on 31-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Exploratory Data Analysis

Assc. Prof. Dr. Azmi Mohd Tamil

Dept of Community Health

Universiti Kebangsaan Malaysia

FK6163 EXEC 2018

Page 2: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Introduction

Method of Exploring Data differs According to Types of Variables

Page 3: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Page 4: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Explore

It is the first step in the analytic process

to explore the characteristics of the data

to screen for errors and correct them

to look for distribution patterns - normal

distribution or not

May require transformation before further

analysis using parametric methods

Or may need analysis using non-parametric

techniques

Page 5: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Data Screening

By running

frequencies, we may

detect inappropriate

responses

How many in the

audience have 15

children and

currently pregnant

with the 16th?

P A R I TY

6 73 0 . 7

4 42 0 . 2

3 61 6 . 5

2 21 0 . 1

2 19 . 6

83 . 7

31 . 4

73 . 2

52 . 3

31 . 4

1. 5

1. 5

2 1 81 0 0 . 0

1

2

3

4

5

6

7

8

9

1 0

1 1

1 5

T o t a l

V a l i d

F re q u e n c yP e rc e n t

Page 6: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Data Screening

See whether the

data make sense or

not.

E.g. Parity 10 but

age only 25.

Page 7: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Page 8: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Page 9: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Data Screening

By looking at measures of central tendency

and range, we can also detect abnormal values

for quantitative data

D e s c rip tiv e S ta tis t ic s

1 8 43 24 8 45 3 .0 53 3 .3 7

1 8 4

P re -p re g n a n c y w e ig h t

V a lid N ( l is tw is e )

NM in im u mM a xim u mM e a n

S td .

D e via tio n

Page 10: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Interpreting the Box Plot

Outlier

Outlier

Upper quartile

Smallest non-outlier

Median

Lower quartile

Largest non-outlier The whiskers extend

to 1.5 times the box

width from both ends

of the box and ends

at an observed value.

Three times the box

width marks the

boundary between

"mild" and "extreme"

outliers.

"mild" = closed dots

"extreme"= open dots

Page 11: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Data Screening

We can

also make

use of

graphical

tools such

as the box

plot to

detect

wrong

data entry 184N =

Pre-pregnancy weight

600

500

400

300

200

100

0

141198211181

73

Page 12: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Data Cleaning

Identify the extreme/wrong values

Check with original data source – i.e.

questionnaire

If incorrect, do the necessary correction.

Correction must be done before

transformation, recoding and analysis.

Page 13: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Parameters of Data Distribution

Mean – central value of data

Standard deviation – measure of how

the data scatter around the mean

Symmetry (skewness) – the degree of

the data pile up on one side of the mean

Kurtosis – how far data scatter from the

mean

Page 14: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Normal distribution

The Normal distribution is

represented by a family of curves

defined uniquely by two parameters,

which are the mean and the

standard deviation of the population.

The curves are always

symmetrically bell shaped, but the

extent to which the bell is

compressed or flattened out

depends on the standard deviation

of the population.

However, the mere fact that a curve

is bell shaped does not mean that it

represents a Normal distribution,

because other distributions may

have a similar sort of shape.

Page 15: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Normal distribution

If the observations follow a

Normal distribution, a range

covered by one standard

deviation above the mean

and one standard deviation

below it includes about

68.3% of the observations;

a range of two standard

deviations above and two

below (+ 2sd) about 95.4%

of the observations; and

of three standard deviations

above and three below (+

3sd) about 99.7% of the

observations

68.3%

95.4%

99.7%

Page 16: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Normality

Why bother with normality??

Because it dictates the type of analysis

that you can run on the data

Page 17: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Normality-Why?Parametric

Variable 1 Variable 2 Criteria Type of Test

Qualitative Qualitative Sample size > 20 dan no

expected value < 5Chi Square Test (X2)

Qualitative

Dichotomus

Qualitative

Dichotomus

Sample size > 30 Proportionate Test

Qualitative

Dichotomus

Qualitative

Dichotomus

Sample size > 40 but with at

least one expected value < 5X

2 Test with Yates

Correction

Qualitative

Dichotomus

Quantitative Normally distributed data Student's t Test

Qualitative

Polinomial

Quantitative Normally distributed data ANOVA

Quantitative Quantitative Repeated measurement of the

same individual & item (e.g.

Hb level before & after

treatment). Normally

distributed data

Paired t Test

Quantitative -

continous

Quantitative -

continous

Normally distributed data Pearson Correlation

& Linear

Regresssion

Page 18: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Normality-Why?Non-parametric

Variable 1 Variable 2 Criteria Type of Test

Qualitative

Dichotomus

Qualitative

Dichotomus

Sample size < 20 or (< 40 but

with at least one expected

value < 5)

Fisher Test

Qualitative

Dichotomus

Quantitative Data not normally distributed Wilcoxon Rank Sum

Test or U Mann-

Whitney Test

Qualitative

Polinomial

Quantitative Data not normally distributed Kruskal-Wallis One

Way ANOVA Test

Quantitative Quantitative Repeated measurement of the

same individual & item

Wilcoxon Rank Sign

Test

Quantitative -

continous/ordina

l

Quantitative -

continous

Data not normally distributed Spearman/Kendall

Rank Correlation

Page 19: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Normality-How?

Explored graphically

• Histogram

• Stem & Leaf

• Box plot

• Normal probability

plot

• Detrended normal

plot

Explored statistically

• Kolmogorov-Smirnov

statistic, with

Lilliefors significance

level and the

Shapiro-Wilks

statistic

• Skew ness (0)

• Kurtosis (0)

– + leptokurtic

– 0 mesokurtik

– - platykurtic

Page 20: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Kolmogorov- Smirnov

In the 1930’s, Andrei Nikolaevich

Kolmogorov (1903-1987) and N.V.

Smirnov (his student) came out with the

approach for comparison of distributions

that did not make use of parameters.

This is known as the Kolmogorov-

Smirnov test.

Page 21: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Skew ness

Skewed to the right

indicates the

presence of large

extreme values

Skewed to the left

indicates the

presence of small

extreme values

Page 22: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Kurtosis

For symmetrical

distribution only.

Describes the shape

of the curve

Mesokurtic -

average shaped

Leptokurtic - narrow

& slim

Platikurtic - flat &

wide

Page 23: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Skew ness & Kurtosis

Skew ness ranges from -3 to 3.

Acceptable range for normality is skew ness

lying between -1 to 1.

Normality should not be based on skew ness

alone; the kurtosis measures the “peak ness”

of the bell-curve (see Fig. 4).

Likewise, acceptable range for normality is

kurtosis lying between -1 to 1.

Page 24: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Page 25: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Normality - ExamplesGraphically

Height

167.5

165.0

162.5

160.0

157.5

155.0

152.5

150.0

147.5

145.0

142.5

140.0

60

50

40

30

20

10

0

Std. Dev = 5.26

Mean = 151.6

N = 218.00

Page 26: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Q&Q Plot

This plot compares the quintiles of a data distribution with the quintiles of a standardised theoretical distribution from a specified family of distributions (in this case, the normal distribution).

If the distributional shapes differ, then the points will plot along a curve instead of a line.

Take note that the interest here is the central portion of the line, severe deviations means non-normality. Deviations at the “ends” of the curve signifies the existence of outliers.

Page 27: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Normality - ExamplesGraphically

Detrended Normal Q-Q Plot of Height

Observed Value

170160150140130

De

v f

rom

No

rma

l.6

.5

.4

.3

.2

.1

0.0

-.1

-.2

Normal Q-Q Plot of Height

Observed Value

170160150140130

Exp

ecte

d N

orm

al

3

2

1

0

-1

-2

-3

Page 28: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Te sts of Norm ality

.060 218 .052Height

Statistic df Sig.

Kolmogorov-Smirnova

Lilliefors Signif icance Correctiona.

De scriptives

151.65 .356

150.94

152.35

151.59

151.50

27.649

5.258

139

168

29

8.00

.148 .165

.061 .328

Mean

Low er Bound

Upper Bound

95% Conf idence

Interval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skew ness

Kurtosis

Height

Statistic Std. Error

Normality - ExamplesStatistically

Shapiro-Wilks; only if

sample size less than 100.

Normal distribution

Mean=median=mode

Skewness & kurtosis

within +1

p > 0.05, so normal

distribution

Page 29: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Normal distribution

Mean=median=mode

Page 30: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

K-S Test

Page 31: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

K-S Test

very sensitive to the sample sizes of the

data.

For small samples (n<20, say), the

likelihood of getting p<0.05 is low

for large samples (n>100), a slight

deviation from normality will result in

being reported as abnormal distribution

Page 32: FK6163 EXEC 2018 Exploratory Data Analysis...2019/07/02  · ©drtamil@gmail.com 2018 Exploratory Data Analysis Assc. Prof. Dr. Azmi Mohd Tamil Dept of Community Health Universiti

©[email protected] 2018

Guide to deciding on normality