exploratory data analysis (descriptive statistics) martina litschmannová...

60
Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová m artina.litschmannova @vsb.cz K210

Upload: reginald-scott

Post on 17-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Exploratory Data Analysis (Descriptive Statistics)

Martina Litschmannová[email protected]

K210

Page 2: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Statistics has two major chapters:

Descriptive Statistics

Inferential statistics

Page 3: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Statistics

Descriptive StatisticsGives numerical and graphic procedures to summarize a collection of data in a clear and understandable way.

Inferential Statistics Provides procedures to draw inferences about a population from a sample.

Page 4: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Populations vs. Sample

A population includes each element from the set of observations that can be made.

A sample consists only of observations drawn from the population.

population

sample

Inferential Statistics

Exploratory Data

Analysis

sampling

Page 5: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Variable

A variable has two defining characteristics: A variable is an attribute that describes a person, place, thing, or idea. The value of the variable can "vary" from one entity to another.

Page 6: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Types of Variables

Types of Variables

Qualitative variable (categorical)

Ordinal variable(a variant can be sorted)

Nominal variable(has equivalent variants)

Quantitative variable (numerical)

Page 7: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Exploratory data analysis

Statistical tools that help examine data in order to describe their main features.

Basic strategy Examine variables one by one, then look at the relationships

among the different variables. Start with graphs, then add numerical summaries of specific

aspects of the data.

Page 8: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Exploratory data analysis - One variable

Graphical displays Qualitative/categorical data: bar chart, pie chart, etc. Quantitative data: histogram, boxplot etc.

Summary statistics Qualitative/categorical: frequency tables Quantitative: mean, median, standard deviation, range etc.)

Page 9: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

EDA - qualitative variable

Page 10: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Summary of categorical variables

Numerically: tables with total counts and percents, mod

Graphically Bar graphs, pie charts

Bar graph nearly always preferable to a pie chart. It is easier to compare bar heights compared to slices of a pie

Page 11: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Statistical characteristics

+ Mod (a variant that occurs most frequently)

Frequency table (or Summary table)Class

xi

Absolute frequencyni

Relative frequencypi

x1 n1 p1=n1 /n

x2 n2 p2=n2 /n

xk nk pk=nk /n

Total: n1+n2+…+nk=n 1

We summarize categorical data using a table. Note that percentages are often called Relative Frequencies

Page 12: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Statistical characteristics

Frequency table

Sex Absolute frequency Relative frequency [%]

Male 457 58,2

Female 328 41,8

Total: 785 100,0

Mod = Male

Page 13: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Graphical Methods of Presenting Qualitative Variables

Bar chartis a standard graph, where variants of the variable are represented on one axis and variable frequencies on the other axis. Individual values of the frequency are then displayed as bars (boxes, vectors, squared logs, cones, etc.)

Page 14: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

A bar chart is made up of columns plotted on a graph.

The columns are positioned over a label that represents a categorical variable.

The height of the column indicates the size of the group defined by the column label.

Attention! We subjectively take notice the volume, rather than the height of the shape!!!

Page 15: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Graphical Methods of Presenting Qualitative Variables

Bar chartis a standard graph where variants of the variable are represented on one axis and variable frequencies on the other axis. Individual values of the frequency are then displayed as bars (boxes, vectors, squared logs, cones, etc.)

Pie Chartrepresents relative frequencies of individual variants of a variable. Frequencies are presented as proportions in a sector of a circle.

Page 16: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210
Page 17: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210
Page 18: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Blood type

Rh factor

TotalRh+ Rh-

0 38 7 45A 34 6 40B 9 2 11

AB 3 1 4Total 84 16 100

Page 19: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

One-way table analysis in Excel

Page 20: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Statgraphics v. 5.0

• Manual: http://people.duke.edu/~rnau/sgwin5.pdf

Page 21: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

One-way table analysis in Statgraphics

Page 22: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

EDA - quantitative variable

Page 23: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Quantitative variables Numerical sumary

Mean Median Quartiles Range Standard deviation…

Graphical summary Histogram Box plot…

Page 24: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Quantitative measures

When you compare two or more data sets, focus on four features:

Center Spread Shape. Unusual features

Page 25: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Measures of Central Tendency

Mean To find the mean of a set of observations, add their values and divide

by the number of observations.

Mean of a population:

Mean of a sample:

Page 26: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Mean example

The average age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room.

Does the average age change? If so, what is the new average age?

Page 27: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Measures of Central Tendency

Median The median is the midpoint of a distribution

The number such that half the observations are smaller and the other half are larger.

Also called the 50th percentile or 2nd quartile. To compute a median

Order observations. If number of observations is odd the median is the center

observation. If number of observations is even the median is the average of the

two center observations.Median of a population:

Median of a sample:

Page 28: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Median example

The median age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room.

Does the median age change? If so, what is the new median age?

The median age of 21 people in a room is 25. A 28 year old leaves while a 30 year old enters the room.

Does the median age change? If so, what is the new median age?

Page 29: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Mean vs. median When histogram is symmetric mean and median are similar.

Mean and median are different when histogram is skewed. Skewed to the right mean is larger than median. Skewed to the left mean is smaller than median.

Page 30: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Mean vs. median

Extreme example Income in small town of 6 people:

$25,000 $27,000 $29,000 $35,000 $37,000 $38,000

Mean is $31,830 and median is $32,000.

Bill Gates moves to town.$25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000

Mean is $5,741,571 median is $35,000.

Mean is pulled by the outlier while the median is not. The median is a better of measure of center for these data.

Page 31: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Effect of Changing Units

How measures of central tendency are affected when we change units (minutes to hours, feet to meters etc.)?

If you add a constant to every value, the mean and median increase by the same constant.

If you multiply every value by a constant, the mean and median will also be multiplied by that constant.

Page 32: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Effect of Changing Units - example

The average annual temperature in Prague is 10 ° C. What is the average annual temperature in Prague in degrees Fahrenheit?

Page 33: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Is a central measure enough?

A warm, stable climate greatly affects some individual’s health. Atlanta and San Diego have about equal average temperatures (62o vs. 64o). If a person’s health requires a stable climate, in which city would you recommend they live?

Page 34: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Measures of spread

Range difference between the largest and smallest values in a set of

values.

Inter-quartile range

lower quartil is the "middle" value in the first half of the rank-ordered data set.

upper quartil is the "middle" value in the second half of the rank-ordered data set.

Page 35: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Measures of spread

Variance In a population, variance is the average squared deviation from

the population mean, as defined by the following formula: .

• Sample variance is defined by slightly different formula, and uses a slightly different notation:

.Standard deviation

• The standard deviation looks at how far observations are from their mean.

Population: Sample:

Page 36: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Measures of spread - example

A population consists of four observations: {1, 3, 5, 7}. What is the variance?

A simple random sample consists of four observations: {1, 3, 5, 7}. Based on these sample observations, what is the best estimate of the standard deviation of the population?

Page 37: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Effect of Changing Units

How measures of spread affected when we change units (minutes to hours, feet to meters etc.)?

If you add a constant to every value, the distance between values does not change. As a result, all of the measures of variability (range, interquartile range, standard deviation, and variance) remain the same.

Suppose you multiply every value by a constant. This has the effect of multiplying the range, interquartile range (IQR), and standard deviation by that constant. It has an even greater effect on the variance. It multiplies the variance by the square of the constant..

Page 38: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Effect of Changing Units - example

The variance annual temperature in Prague is 0,25 (° C)2. What is the variance annual temperature in Prague in square degrees Fahrenheit?

Page 39: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Measures of position

Percentiles Assume that the elements in a data set are rank ordered from the

smallest to the largest. The values that divide a rank-ordered set of elements into 100 equal parts are called percentiles.

Quartiles (lower quartil, median, upper quartil) Assume that the elements in a data set are rank ordered from the

smallest to the largest. The values that divide a rank-ordered set of elements into 4 equal parts are called quartiles.

Standard Scores (z-scores) z-score indicates how many standard deviations an element is from

the mean. A standard score can be calculated from the formula:

Page 40: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

How to interpret z-score?

… an element less than the mean. … an element greater than the mean. … an element equal to the mean. … an element that is 1 standard deviation greater than the mean; , 2

standard deviations greater than the mean; etc. … an element that is 1 standard deviation less than the mean; , 2

standard deviations less than the mean; etc. If the number of elements in the set is large, about 68% of the

elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2; and about 99% have a z-score between -3 and 3. … an element is outlier

Page 41: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

z-score - Example

A national achievement test is administered annually to 3rd graders. The test has a mean score of 100 and a standard deviation of 15. If Jane's z-score is 1.20, what was her score on the test?

Page 42: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms - made up of columns plotted on a graph There is no space between adjacent columns. The columns are positioned over a label that represents a

quantitative variable. The column label can be a single value or a range of values. The height of the column indicates the size of the group defined by

the column label.

Graphical Methods of Presenting Qualitative Variables

Page 43: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210
Page 44: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histogram

Hemoglobin

frequency

8,4 10,4 12,4 14,4 16,4 18,40

2

4

6

8

10

Page 45: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

Where did the bins come from? They were chosen rather arbitrarily.

Does choosing other bins change the picture? Yes!! And sometimes dramatically.

What do we do about this? Some pretty smart people have come up with some “optimal” bin

widths and we will rely on there suggestions.

Optimal number of bins: (Sturges rule)

Page 46: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histogram

The purpose of a graph is to help us understand the data.

After you make a graph, always ask, “What do I see?”

Once you have displayed a distribution you can see the important features.

Page 47: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

We will describe the features of the distribution that the histogram is displaying with three characteristics.

Shape Center Spread Unusual Features

Page 48: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

Shape

Symmetry - when it is graphed, a symmetric distribution can be divided at the center so that each half is a mirror image of the other.

Page 49: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

Shape

Number of peaks. Distributions with one clear peak are called unimodal. Distributions with two clear peaks are called bimodal. When a symmetric distribution has a single peak at the center, it is

referred to as bell-shaped.

Page 50: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

Shape

Skewness - when they are displayed graphically, some distributions have many more observations on one side of the graph than the other.

Distributions with most of their observations on the left (toward lower values) are said to be skewed right.

Distributions with most of their observations on the right (toward higher values) are said to be skewed left.

Page 51: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

Shape

Skewness – measure of the asymetrySample skewness:

… skewed right … skewed left … symetric

Page 52: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

Shape

Uniform - when the observations in a set of data are equally spread across the range of the distribution, the distribution is called a uniform distribution.

Page 53: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

CenterGraphically, the center of a distribution is located at the median of the distribution.

𝑥<𝑥0,5 𝑥=𝑥0,5 𝑥>𝑥0,5

Page 54: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

SpreadThe spread of a distribution refers to the variability of the data. If the observations cover a wide range, the spread is larger. If the observations are clustered around a single value, the spread is

smaller.

Kurtosis – measure of the kurtosisSample kurtosis:

… big kurtosis (less spread) … small kurtosis (more spread)

Page 55: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

Unusual Features

Gaps. Gaps refer to areas of a distribution where there are no observations.

Outliers. Sometimes, distributions are characterized by extreme values that differ greatly from the other observations. These extreme values are called outliers.

How can we identify outliers? … an element is outlier Rule of thumb:

• extreme value is often considered to be an outlier if it is at least 1.5 interquartile ranges below the lower quartil, or at least 1.5 interquartile ranges above the upper quartil.

Page 56: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Histograms

Unusual Features

Gaps. Gaps refer to areas of a distribution where there are no observations.

Outliers. Sometimes, distributions are characterized by extreme values that differ greatly from the other observations. These extreme values are called outliers.

Page 57: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Box and whiskers plot

A boxplot splits the data set into quartiles. The body of the boxplot consists of a "box" (hence, the name), which

goes from the lower quartile (Q1) to the upper quartile (Q3). Within the box, a vertical line is drawn at the Q2, the median of the

data set. Two horizontal lines, called whiskers. The front whisker goes from Q1

to the smallest non-outlier in the data set (Q1-1,5IQR), and the back whisker goes from Q3 to the largest non-outlier (Q3+1,5IQR).

If the data set includes one or more outliers, they are plotted separately as points on the chart.

Page 58: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

How to interpret a box plot?

Range IQR

Shape of distribution

Page 59: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Quantitative variable analysis in Excel

Page 60: Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Quantitative variable analysis in Statgraphics