exploratory data analysis - ocsc · exploratory data analysis • exploratory data analysis or...

Post on 21-Jul-2020

23 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Exploratory data analysis

November 29, 2017

Dr. Khajonpong Akkarajitsakul

Department of Computer Engineering, Faculty of EngineeringKing Mongkut’s University of Technology Thonburi

Module III – OverviewLearning Outcome• Understand the basic

concepts of exploratory data analysis

• Understand the role of statistics in data exploration

• Choose appropriate data analysis techniques to explore and analyze data

Agenda• Basic concepts of

exploratory data analysis (EDA)

• EDA techniques

Introduction to data

• Process of scientific inquiry:1. Identify a question or problem2. Collect relevant data on the topic3. Analyze the data4. Form a conclusion

• Statistics focuses on making stages (2) – (4) objective, rigorous, and efficient.

• Statistics is the study of how best to collect, analyze and draw conclusion from data.

Exploratory Data Analysis

• Exploratory data analysis or “EDA” is a critical step in analyzing the data

• The main reasons are• detection of mistakes, outliers or abnormalities• checking of assumptions• preliminary selection of appropriate models• determining relationships among the explanatory variables• assessing the direction and rough size of relationships between

explanatory and outcome variables

Exploratory Data Analysis

Data format

• Data from either experiments or operations are generally collected in databases (e.g. spreadsheet)

• One row per record and one column for each identifiers, outcome variables, and explanatory variables

• Each column contains the numerical value of a particular quantitative variable (aka measure) or the levels for a categorical variable (aka dimension)

Data: readingimport pandas as pddf = pd.read_csv(‘mtcars.csv’, header=0)

Types of EDA

• Graphical or Non-graphical• Non-graphical methods usually involve with calculation of summary

statistics• Graphical methods obviously summarize the data in a diagrammatic or

pictorial way

• Univariate or Multivariate• Univariate methods look at one variable (column) at a time while

multivariate methods look at two or more variables at a time• Usually it is a good idea to perform univariate EDA first for each of a

component of multivariate EDA

Four basic types of EDA

• Univariate non-graphical• Multivariate non-graphical• Univariate graphical• Multivariate graphical

Univariate non-graphical EDA

• This is to measure certain characteristics (e.g. age, gender, speed at a task, or response to a stimulus) of data of all subjects/records

• We should think of measurements as representations of a “sample distribution”, which in turn more or less representing the “population distribution”

• The goal is to better understand the “sample distribution” and make some conclusion about the “population distribution”

Univariate non-graphical EDACategorical data• The characteristics of interest for a categorical variable are

simply the range of values and the frequency (or relative frequency) of occurrence for each value.

• Therefore the only useful univariate non-graphical techniques for categorical variables is some form of tabulation of the frequencies, usually along with calculation of the fraction (or percent) of data that falls in each category.

Pandas: creating frequency tableimport pandas as pddf = pd.read_csv(‘iris.csv’, header=0)df.Species.value_counts()

• Univariate EDA for a quantitative variable is a way to make preliminary assessments about the population distribution of the variable using the data of the observed sample.

• The characteristics of the population distribution of a quantitative variable are its center, spread, modality (number of peaks in the pdf), shape (including “heaviness of the tails"), and outliers.

Univariate non-graphical EDAQuantitative data

Histogram

What can you see from this histogram?

Central tendencySpreadSkewnessEtc.

Central tendencyMean

Central tendencyMedian

Central tendencyWhich location measure is the best?

Mean vs Median

Min. 1st Qu. Median Mean 3rd Qu. Max. 1000 8500 14000 15244 20000 35000

Median < Mean

What happen?

Scale: Variance

Why squared deviations?

• Adding deviations will yield a sum of ?• Absolute values do not have nice mathematical properties

(non-linear)• Squares eliminate the negatives

• Results:• Increasing contribution to the variance as you go farther from the

mean

Scale: Variance

• Variance is somewhat arbitrary• What does it mean to have a variance of 8.9? Or 1.5? Or

1245.34? Or 0.00001?• Nothing. But if you could “standardize” that value, you could

talk or compare about any variance (i.e. deviation) in equivalent terms

• Standard deviations are simply the square root of the variance

Scale: Standard deviation

Scale: Quartiles and IQR

Percentiles (aka Quantiles)

Common distribution shapes

Other distribution shapes

Skewness IPositively skewed• Longer tail in the high value• Mean > Median > Mode

Skewness IINegatively skewed• Longer tail in the low value• Mode > Median > Mean

Univariate Graphical EDAHistogram• Histogram is a graphical representation of

the distribution of numerical data

• It provides a view of data density and the shape of data distribution

• To construct a histogram, the first step is to • bin the range of values• count how many values fall into each interval

• The bins are usually specified as consecutive, non-overlapping intervals of a variable.

• The bins (intervals) must be adjacent, and are usually equal size.

Univariate Graphical EDAEffects of Histogram Bin

Univariate Graphical EDAEffects of Number of Samples

Matplotlib: Histogram• import matplotlib.pyplot as plt• import numpy as np

• data = np.random.randn(1000)• plt.hist(data)• plt.title("Histogram")• plt.xlabel("Value")• plt.ylabel("Frequency")• plt.show()

Boxplot

• The box in boxplot represents the middle 50% of the data

• The middle line indicates median

• Whiskers can be designated as either

• Max/Min• Outlier boundaries

• Upper = Q3 + 1.5*IQR• Lower = Q1 – 1.5*IQR

Matplotlib: Box plots• import matplotlib.pyplot as plt• import numpy as np

• spread = np.random.rand(50) * 100• center = np.ones(25) * 50• flier_high = np.random.rand(10) * 100 + 100• flier_low = np.random.rand(10) * -100• data = np.concatenate((spread, center, • flier_high, flier_low), 0)• plt.boxplot(data)• plt.show()

Multivariate Non-Graphical EDA

• Multivariate non-graphical EDA techniques show the relationship between two or more variables in the form of either cross-tabular or statistics

Cross-Tabulation

Pandas: Cross-Tabulationimport pandas as pd

raw_data = {'SUBJECT_ID': ['GW','JA','TJ','JMA','JMO','JQA','AJ','MVB','WHH','JT','JKP'],'AGE_GROUP': ['Y','M','Y','Y','M','O','O','Y','O','Y','M'],'SEX': ['F','F','M','M','F','F','F','M','F','F','M']}

df = pd.DataFrame(raw_data, columns = ['SUBJECT_ID', 'AGE_GROUP', 'SEX'])print dfprint pd.crosstab(df.AGE_GROUP, df.SEX, margins=True)

Multivariate Graphical EDAScatter plot

• A scatter plot is a graph of the ordered pairs (x,y) of numbers consisting of the independent variable x and the dependent variable y.

Subject Age x Pressure yA 43 128B 48 120C 56 135D 61 143E 67 141F 70 152

120

125

130

135

140

145

150

155

40 50 60 70 80

Pres

sure

Age

Positive and negative relationship

• Simple relationships can also be positive or negative

• A positive relationship exists when both variables increase or decrease at the same time.

• In a negative relationship, as one variable increases, the other variable decreases, and vice versa.

Pearson product moment correlation

• Given n pairs of observations (x1,y1), (x2,y2), …,(xn,yn)

• It is natural to speak of x and y having a positive relationship if large x’s are paired with large y’s and small x’s with small y’s

• On the contrary, if large x’s are paired with small y’s and small x’s with large y’s, then a negative relationship between the variable is implied

Pearson product moment correlation

• Consider the quantity

• Then, if the relationship is strongly positive, an xiabove the mean will tend to be paired with a yi above the mean, so that and this product will also be positive whenever both xiand yi are below their means

( )( )1

n

xy i ii

s x x y y=

= − −∑

( )( ) 0i ix x y y− − >

Pearson product moment correlation

• To make this measure dimensionless, we divide as follow

• A more convenient form for this equation is

( )( )( ) ( )

12 2

1 1

ni ixy i

n nxx yy i ii i

x x y ysr

s s x x y y=

= =

− −= =

− −

∑∑ ∑

( ) ( )( )( ) ( ) ( ) ( )2 22 2

i i i i

i i i i

n x y x yr

n x x n y y

−=

− −

∑ ∑ ∑∑ ∑ ∑ ∑

Correlation coefficient and scatter plot

• The range of the correlation coefficient is from -1 to 1.• If there is a strong positive linear relationship between the variables,

the value of r will be close to 1.• If there is a strong negative linear relationship, the value of r will be

close to -1.• When there is no linear relationship between the variables or only a

weak one, the value of r will be close to 0

Strong correlation?

• A frequently asked question is: “what can it be said that there is a strong correlation between variables, and when is the correlation weak?”

• A reasonable rule of thumb is to say that the correlation is

• weak if 0≤|r| ≤0.5• strong 0.8≤|r|≤1• moderate otherwise

Correlation and shape

Remarks on correlation• When the null hypothesis has been rejected for a

specific value, any of the following five relationships between variables can exist

1. There is a direct cause-and-effect relationship between the variables. That is x causes y.

2. There is a reverse cause-and-effect relationship between the variables. That is y causes x.

3. The relationship between the variables may be caused by a third variable

4. The may be a complexity of interrelationships among many variables

5. The relationship may be coincidental

Pandas: Correlationimport matplotlib.pyplot as pltimport pandas as pdfrom pandas.tools.plotting import scatter_matrixurl = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"df = pd.read_csv(url)df[['mpg', 'disp', 'drat', 'wt']].corr(method='pearson', min_periods=1)

Pandas: Scatter Matrix• scatter_matrix(df[['mpg', 'disp', 'drat', 'wt']], alpha=0.2, figsize=(6, 6), diagonal='kde')• plt.show()

End of Lecture

Thank you

top related