descriptive statistics and inferential statistics shibin liu sas beijing r&d

Descriptive Statistics and Inferential Statistics

Shibin LiuSAS Beijing R&D

Agenda

• 0. Lesson overview• 1. Basic Statistical Concepts• 2. Descriptive Statistics • 3. Picturing your data• 4. Confidence Intervals for the mean• 5. Hypothesis Testing• 6. Summary

Agenda

Lesson overview

X a v b f R

P 9 O 8I t 0 3 L

Lesson overview

Descriptive Statistics

Inferential Statistics

Lesson overview

Populations and Samples

Parameters and Statistics

Variable Classifications

Scales of Measurement

Lesson overview

Point estimators

Distribution of sample means

Central limit theorem

Standard error of the mean

Confidence intervals

Lesson overview

Define null /alternative hypotheses

Specify significance level

Collect data

Reject or fail to reject the null hypothesis

Lesson overview

What do you want to

examine?

The relationship between variables

The difference between groups on one or more

variables

The location, spread, and shape of the data’s

distribution

Summary statistics

or graphics?

How many

groups?Which kind

of variables?

SUMMARY STATISTICS

DISTRIBUTION ANALYSIS

LINEAR MODELS

CORRELATIONS

ONE-WAY FREQUENCIES

& TABLE ANALYSIS

LINEAR REGRESSION

LOGISTIC REGRESSION

Summary statistics

Both Two

Two or more

Descriptive Statistics, histogram, normal,

probability plots

Analysis of variance

Continuous only

Frequency tables, chi-square test

Categorical response variable

Lesson 1 Lesson 2 Lesson 3 & 4 Lesson 5

Agenda

Basic Statistical Concepts

Descriptive Statistics and Inferential Statistics

Variable Classifications

Scales of Measurement

Statistical Methods

• In this topic, you will learn to do following:– Distinguish between descriptive and inferential

statistics – Define populations and samples– Distinguish between parameters and statistics – Classify variables– Explain other statistical concepts, including scale

of measurement

314062

% percent

μ average

variabilityσ

describe

summarize??? !!!

Measures of location

Measures of variability

Frequency

Draw preliminary conclusions

Characterize central tendency

Inspect the spread and shape

Screen for unusual data values

Goals:

Extreme values?

Unusual shape?

Data value cluster?

Unique feature?

Error free?

reliability

precision

generalize

Inferential StatisticsDescriptive Statistics

Population= complete set of observations

Representative sample

Characteristics similar to population's characteristics

sample

Simple random sample

Each sample of a given size

Equal chance of selection

Convenience sampling

May not be representative, Why? For example:

a. the first n…b. Male/female

Question 1.

A sample from a population should be

a) Randomb) Representativec) Biasedd) All of above

Answer: b

Question 2.

A medical researcher wants to verify if patients at a hospital during a particular time frame had the positive effect. From the entire set of 5000 patients admitted to the hospital during that time, the researcher selects a sample of 200 patients.Match these items between the two groups:

All 5000 patients

The 200 selected patients

The first 200 patients admitted to the hospital during this time frame

200 patients whose patients ID numbers were selected at random from the 5000

a) Simple random sample

b) Convenience sample

c) Population

d) Sample

Parameters

Statistics

σρ μ

Greek letters

English letters

Parameters are numerical values that summarize characteristics of population.Usually unknown.

Population Parameters Sample Statistics

Mean μ

Variance σ2 s2

Standard Deviation σ s

Sample mean

Sample variance

Sample standard deviation

Uses of Variables: Independent and Dependent

Independent variable

Dependent variable

Response variable

Outcome variable

Control variable

Explanatory variable

Predictor variable

Different values changes as required

Different values changes in response

Context 1.Control: fertilizer->plant

Context 2.Dependent variable value is given: credit default or not

Variables Classification by Types: Quantitative and Categorical

Quantitative Categorical

Quantitative

Numerical

1245 0.36 206 31 8

Arithmetic operations

Discrete

Continuous

0 1 2 3 4 5 6

Discrete

Continuous

200, 201, 201.13, 199.904 (ml/bottle)

Categorical

Attribute data

No arithmetic operations

Nominal

Ordinal

No ordering

Nominal

Gender

Beverage

Qualitative

Classification

Degree: Low, moderate, severeOrdinal

Question 3.

a) Categorical (ordinal)b) Quantitative (discrete)c) Categorical (nominal)d) Quantitative (continuous)

Match these items between the two groups:

The number of students in a classHair color (black, brown, gray, red)Income (low, medium, high)The numbers on athletes’ team shirtThe height of trees in a forestName (Adam, Becky, Christina, Dave…)The length of time for a light bulb to burn outThe number of phone calls a company receives

CACDCD

By Scale of measurement: Nominal Ordinal Interval Ratio

Scale of measurement?

To know the scale of measurement for each variable, then select statistical procedure appropriate useful for that variable.

Nominal

Degree: Low, moderate, severeOrdinal

Continuous

Interval

Low Medium High

1. Body temperature: no true 0 point

2. PH=8 (Sea) ≠ 2* PH=4 (Tomato juice)

True 0 point

1. 2 kg apples; 0 kg apples

2. 0 $; 100$ = 2* 50$

Question 4.

a) Categorical (nomina)b) Quantitative (interval)c) Categorical (ordinal)d) Quantitative (ratio)

Match these items between the two groups:

ID BodyTempGenderHeartRate

Statistical Methods

Univariate

Bivariate

Multivariable

Multivariate

The selection of Statistical Methods depends on how many variables involved.

Statistical Methods

Univariate

RangeDispersionCentral tendencyFrequency

Statistical Methods

Bivariate

Correlation analysis

Chi-square tests of independence

Statistical Methods

Multiple linear regression

N-way ANOVA

Multivariable

Statistical Methods

Factor analysis

Clustering Multivariate

Multivariate ANOVA

Multivariate linear regression

Scenario: exploring your data

School of Carver County want to check if the average score of SAT is 1200?

Gender SATScore IDNumber

Male 1170 61469897

Female 1090 33081197

Male 1240 68137597

Female 1000 37070397

Male 1210 64608797

Female 970 60714297

Male 1020 16907997

Female 1490 9589297

Male 1200 93891897

Female 1260 85859397

80 students are selected randomly, and their scores are recorded into TestScores.

TestScores

Question 5.

a) Yes b) No

Is the data for these 80 students representative of the entire population of magnet school students in Carver County?

Question 6.

a) Categorical, quantitative, categorical b) Categorical, quantitative, quantitativec) Quantitative, quantitative, categorical d) Quantitative, categorical, categorical

Select the correct type for the following variables: Gender SATScore IDNumber

Question 7.

a) Univariate b) Bivariate c) Multivariate d) All of above

What types of statistical methods can be applied to this data?

Agenda

In this topic, we will learn to do the following:

– Explain the basics of descriptive statistics– Describe distributions can tell you about your data– Use the Summary Statistics task to produce descriptive

statistics

Describing your data

Number of times each value appears

Outlier

Measures of Location

Median

9389888483827978787774737268686835

Mode most common data value, 68 here

Middle value, when ordered

Odd : middle value Even : average of two middle values

Mode=1

Median=1.5

1 3 10

Mean=3

Question 8.

a) Meanb) Median c) Either mean or median

For a symmetric distribution, which of the following statistics is a good measure for the middle of the data?

Answer: c

Percentile

Percentile = the value of a variable below which % of observations fallPercentile locate a position in your data rather than a given proportion of data valuesQuartiles break your data into quarters

63 69 75 81 87 930

Percentile

63 69 75 81 87 930

93898884838279787877747372686763

25th Percentile =72.5

First quartile

Q1Lower quartile

Percentile

63 69 75 81 87 930

93898884838279787877747372686763

50th Percentile =78

Median

Q2Middle quartile

Percentile

63 69 75 81 87 930

93898884838279787877747372686763

75th Percentile =83.5

Third quartile

Q3Upper quartile

Measure of Variability

Sample variance:

Measure Definition

Range The difference between the maximum and minimum data values

Interquartile Range

The difference between the 25th and 75th percentiles

Variance A measure of dispersion of the data around the mean

Standard Deviation

A measure of dispersion expressed in the same units of measurement as your data (the square root of the variance)

Coefficient of variation(C.V.): 𝑠𝑥×100

Measure of Variability

Question 9.

a) True b) False

The IQR(interquartile range ) is a measure of the middle 50% of your data. It is not as likely to affected by outliers and therefore is more robust than the overall range.

Scenario

Check the SAT score of XXX school, 80 random selecteddescriptive statistics

the average=1200?

Agenda

Picturing your data: Objectives

– Look at the distribution of continuous variables– Describe the normal distribution– Use the Distribution Analysis task to generate descriptive

statistics, including histogram, box and normal probability plots.

Picturing your data: Normal Distributions

Characters of normal distribution:– Symmetric– Defined by μ and σ– Bell shaped– Mean=median=mode

The bean machine, a device invented by Francis Galton, can be called the first generator of normal random variables. This machine consists of a vertical board with interleaved rows of pins. Small balls are dropped from the top and then bounce randomly left or right as they hit the pins. The balls are collected into bins at the bottom and settle down into a pattern resembling the Gaussian curve.

Planche de Galton

Standard Normal Distribution when: μ= 0; σ= 1

𝑓 (𝑥 ;𝜇 ,𝜎2 )= 1𝜎 √2𝜋

𝑒− 12

( 𝑥−𝜇𝜎

Picturing your data: Assessing normality

For a normal distribution In SAS: Skewness=0; Kurtosis=0,In some other place: Skewness=0; Kurtosis=3

Statistical summary

Picturing your data: Skewness

Skewness>0Mean>median

Skewness<0Mean<median

Picturing your data: Kurtosis

Kurtosis measures the tendency of your data to be distributed toward the center or toward the tails of the distribution.

For a symmetric distribution (in SAS):

Kurtosis = 0, Normal;Kurtosis < 0, platykurtic;Kurtosis > 0 leptokurtic

Picturing your data

Question 10.

a) Heavy-tailed b) Light-tailedc) Right-skewedd) Normale) Left-skewed

Now that you understand the many different ways your data could be distributed, match each example with its appropriate shape.The blue curve on each histogram represents the shape of the normal distribution with the mean and standard deviation estimated from the sample.

Picturing your data

Mean kurtosisMedianStandard deviationvariancemodeSkewnessIQR(interquartile range)

Question 11.

a) Location

b) Spread

c) Shape

Match the term to what it measured on the right by dragging and dropping, or by selecting the correct letter.

Picturing your data: Normal Probability Plot

A Normal Probability Plot is a visual method for determining whether your data comes from a distribution that is approximately normal.

actual data values

expected percentile from standard normal

Normal reference line

Picturing your data: Normal Probability Plot

1. Normal distribution (the observed data follow the reference line)2. Skewed-to-the –right distribution3. Skewed-to-the –left distribution4. Light-tailed distribution5. Heavy-tailed distribution

Picturing your data: Box Plot

Outliers> 1.5IQR from the boxLargest point <=1.5IQR from the box

The 75th percentile

Smallest point <=1.5IQR from the box

The 50th percentile

The 25th percentile

The Mean is denoted by a

Picturing your data

a) Check if the mean and median are equalb) Check if the skewness and kurtosis are close to 0c) Create histograms and other graphical tools to

visually assess the data d) All of above

Question 12.

Answer: d

Which of the following are ways to check the normality of your data?

Picturing your data

Scenario: Assessing the normality

Check the SAT score of XXX school, 80 random selected, the average=1200?

Range FrequencyShape

Normal distribution Mean, Median

Skewness, Kurtosis

Outlier

DemoTask > Describe> Distribution Analysis

Agenda

Confidence Intervals for the mean: Objectives

– Define the distribution of sample means and the central limit theorem

– Calculate and intercept standard error of the mean and confidence intervals for the mean

– Use the Summary Statistics task to generate the standard error of the mean and the confidence intervals for the mean

Confidence Intervals for the mean

Point estimator

Parameters

Statistics

σρ μ

μEstimates

s Estimates σ

Point estimator

Variability

=1.2 =1.3

How much the estimator varies from sample to sample?

Variability

Standard Error

Standard Deviation

Variability of a sample statistic

Variability of data

Variability

Standard Error of the mean

Variability of sample means

𝑠𝒙=𝑠

√𝑛Where S is the sample standard deviationn is the sample size

What will be if n is larger? The larger sampleThe smaller standard errorThe more precise sample mean

Variability

Distribution of sample means

The distribution of sample means is less variable than data

Confidence Interval

95% Confidence

A 95% confidence interval represents a range of values within which you are 95% certain the true population mean exists.• One interpretation is that if 100 different samples were drawn

from the same population and 100 intervals were calculated, approximately 95 of them would contain the population mean.

Confidence Interval

95% Confidence

A confidence interval. Is a range of values that you believe to contain the population parameter of interest.. Is defined by an upper and lower bound around a sample statistic.

To construct a confidence interval, a significant level must be chosen.

Question: Why not 99%?

Answer: The larger interval gives less useful information.

Confidence Interval for the mean

𝑠𝒙=𝑠

√𝑛

𝒙± 𝒕 ∙𝑠 𝒙

is the sample meant is the value corresponding to the confidence

level and n-1 degrees of freedom, where n is the sample size.is the standard error of the mean

Student’s t-distribution (or simply the t-distribution) is a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

Confidence Interval for the mean

𝑠𝒙=𝑠

√𝑛

𝒙± 𝒕 ∙𝑠 𝒙

is the sample meant is the value corresponding to the confidence

level and n-1 degrees of freedom, where n is the sample size.is the standard error of the mean

Question: how to decrease the width of the interval?

Answer: way 1: increase sample size nway 2:decrease confidence level

Confidence Interval Sometimes Miss

( ) ( )

Normality and the Central Limit Theorem

To satisfy the assumption of normality, you can do one of the following:• Verify that the population distribution is approximately

normal• Apply the Central Limit Theorem

– The central limit theorem states that the distribution of sample means is approximately normal. Regardless of the population’s shape. If the sample size is large enough.

– “Large enough” is usually about 30 observations. It is more if the data are heavily skewed, and fewer if the data are symmetric.

Normality and the Central Limit Theorem

Data from an exponential distribution 1000 samples of size 5

1000 samples of size 10 1000 samples of size 30

a) The true average SAT scores is significantly different from 1200.b) The true average SAT scores is not significantly different from

1200.c) The true average SAT scores is less than 1200.d) None of the above. You cannot determine statistical significance

from confidence interval.

Question 13.

Answer: b

A 95% confidence interval for SAT scores is (1157.90, 1223.35). What can you conclude from this information?

Confidence Intervals for the mean:Demo

Task > Summary Statistics > Statistics > Additional > Confidence Intervals of the mean, 95%

Agenda

Hypothesis Testing

Hypothesis Test

Parameters

Statistics

μ= μ0

Hypothesis Testing

– Design and conduct a hypothesis test– Use the p-value to determine statistical significance– Use the Distribution Analysis task to perform a statistical

hypothesis test– Perform a one-sample, two-sided t-test to determine if the

population mean is significantly different from a known value.

Hypothesis Testing: Decision making process

Use a decision rule to make a judgment. If the evidence contradicting the null hypothesis is

– Sufficiently strong to meet the burden if proof (significance level), then reject the null hypothesis

– Not strong enough to meet the burden if proof , then fail to reject the null hypothesis. Be aware that failing to prove guilty does not mean that the defendant is proven innocent. It could mean that the prosecuting attorney did not build a strong enough case to meet the burden of proof.

Define null alternative hypotheses

Collect data

Hypothesis Testing: Coin example

Collect data

H0: null hypothesis

The coin is fair

Ha or H1 :alternative hypothesis

The coin is not fair

Collect data

Reject or fail to reject the null hypothesis Amount of evidence needed to reject the null hypothesis

Significant level

𝛼=0.05

Collect data

Two headsThree tails

Question: is it possible to get 100 tails?

Collect data

Two headsThree tails

100 tails

H0: coin is fair

Not reject H0

Reject H0

Hypothesis Testing: Type of error and Power

ACTUAL

Decision H0 is True H0 is False

Fail to Reject Null Correct Type II Error

Reject Null Type I Error Correct

Probability of Type I Error = H0 : coin is fair

Conclude: coin is not fair

Probability of Type II Error =

Probability of Correct Rejection= =Power

An unfair coin is regarded fair

𝛼 𝛽

Hypothesis Testing: Type of error and Power

a) Probability the manager concludes that the rate is indeed above 1%, when in reality it is not.

b) Probability the manager concludes that there is no evidence that the rate is above 1%, when in reality it is above 1%.

Question 14.

A bank manager is concerned that the percent of loans that contain errors has increased above the acceptable amount of 1%. A significance test is conducted to test his concern (H0 : p<=.01, Ha : p>=.01)

Match these items between left and right:

Type I Error

Type II Error b

Hypothesis Testing:Coin Experiment – Effect Size Influence

The effect size refers to the magnitude of the difference in sampled population from the null hypothesis. In this example, the null hypothesis of a fair coin suggests 50% heads 50% tails. If the true coin flipped were actually weighted to give 55% heads, the effect size would be 5%.

55 Heads45 Tails

P-value =.3682

40 Heads60 Tails

P-value =.0569

37 Heads63 Tails

P-value =.0120

15 Heads85 Tails

P-value =.0001

Flip a coin 100 times and decide whether it is fair

A P-value measures the probability of observing a value as extreme or more extreme than the one observed.

Hypothesis Testing:Coin Experiment – Sample Size Influence

When the effect size keep constant, the P-value decreases along with the sample size increases.

4 Heads6 Tails

P-value =.7539

16 Heads24 Tails

P-value =.2682

40 Heads60 Tails

P-value =.0569

160 Heads240 Tails

P-value <.0001

Flip a coin and get 40% heads, decide whether it is fair

Hypothesis Testing: Statistical Hypothesis Test

Statistical Hypothesis Test

Collect data

H0 : equality, H1 : difference

Set α, 5% for example

Collect data

Reject : P-value < αFail to reject: P-value α

The value of α the probability of Type I error, is specified by the experimenter before collecting data.The strength of the evidence is measured by the a P-value.The P-value is calculated from the collected data.

Hypothesis Testing: Performing a Hypothesis Test

To test the null hypothesis H0: μ=μ0 , SAS software calculates the Student’s t statistic value:

The null hypothesis is rejected when the calculated value is more extreme (either positive or negative) than would be expected by chance if H0 were true.

For the test score example:

𝑡=𝑥−μ0𝑠𝑥

𝑡=1190.625−1200

16.4416=−0.5702

Two sided test of a hypothesis: each tail takes 2.5% for a 5% α

Why T –test?

a. Normal distributionb. σ unknownc. Small sample < 30

T-test 1 by Distribution Analysis:

T-test 2 by Task>ANOVA>T test:

Agenda

Home Work: Exercise 1.1.1 Calculating Basic Statistics in the Summary Statistics Task

The data in Norm Temp comes from an article in the Journal of Statistics Education by Dr. Allen L. Shoemaker. The data questions the notion that the true mean body temperature is 98.6. There are 65 males and 65 females. There is also some question about whether mean body temperatures for women are the same as for men. The variables in the data set are as follows:

ID Identification numberBodyTemp Body temperature (degree Fahrenheit) HeartRate Coded (Male, Female)

a. Open a new process and the NormTemp dataset. Use Summary Statistics Task to answer the next two questions.b. What is the overall mean and standard deviation of BodyTemp in the sample?c. What is the interquartile range of BodyTemp?

Home Work: Exercise 2.2.1 Producing Descriptive Statistics Use the NormTemp dataset to answer the following: a. What are the minimum, the maximum, the

mean, and the standard deviation for BodyTemp? Does the variable appear to be normally distributed?

b. Create box plots for BodyTemp. Does the average BodyTemp seem to be 98.6 degrees?

2.2 Multiple Choice PollIn the NormTemp dataset, the distribution of BodyTemp seemed :c. Close to normald. Left skewede. Right skewedf. To have high positive kurtosisg. To have high negative kurtosis

BodyTempMinimum Maximum Mean Standard deviation Skewness KurtosisDistribution: Normal Yes/No

Home Work: Exercise 3.3.1 Producing Confidence Intervalsa. Generate the 95% confidence interval for the mean of BodyTemp in

the NormTemp dataset b. Is the assumption of normality met to produce a confidence interval

for this data?c. What are the bounds of the confidence interval?

3.2 Multiple Answer PollThe distribution of sample means is approximately normal if which of the following are true?d. The population is normal.e. The sample size is “large enough.”f. The sample standard deviation is small

Home Work: Exercise 4.4.1 Performing a One-Sample t-Testa. Performing a One-Sample t-Test to determine whether the mean of body temperatures

(the variable BodyTemp in the NormTemp dataset) is truly 98.6b. What is the value of the t statistic and the corresponding p-value?c. Produce a confidence interval plot of BodyTemp.d. Do you reject or fail to reject the null hypothesis at the .05 level that the average

temperature is 98.6 degrees?

4.2 Multiple Choice PollA 95% confidence interval for SAT scores is (1157.90, 1223.35). From this, what can you conclude, at alpha=0.05?e. The true average SAT score is significantly different from 1200.f. The true average SAT score is not significantly different from 1200.g. The true average SAT score is less than 1200.h. Non of the above – you cannot determine statistical significance from confidence

intervals.

Thank you!

descriptive statistics and inferential statistics shibin liu sas beijing r&d

descriptive statistics

descriptive statistics

basic statistical concepts

lesson overview

null hypothesis slide

scale of measurement

mean confidence intervals

data reject

Documents

chapter 6: probability. flow of inferential statistics and...

¾ inferential statistics -...

descriptive and inferential statistics

new topic 4: inferential statistics · 2019. 9. 4. · qm...

inferential statistics 2

module 4: inferential statistics

overview of inferential statistics - interfetpthailand...

inferential statistics & regression

inferential statistics: hypothesis testing

two-sample inferential statistics

inferential statistics - sacramento state 250... ·...

inferential statistics education 604 katie rommel...

variable inferential statistics

inferential statistics:

from descriptive statistics to inferential statistics

inferential statistics-estimation

inferential statistics -...

inferential statistics by example

social statistics: inferential statistics

reviewing inferential statistics -