Download - LSSGB Lesson4 Analyze

Copyright 2014, Simplilearn, All rights reserved.


Lesson 4—Analyze

Lean Six Sigma Green Belt


● Explain the patterns of variation

● Describe the classes of distributions

● Discuss Multi-Vari studies and the causes

● Explain correlation and its types

● Discuss the various hypothesis tests

● Discuss the application of F-test, t-test, ANOVA, and Chi-squared

After completing this lesson, you will be able to:

Objectives

2


Analyze

Topic 1—Patterns of Variation


Classes of Distributions

The data obtained from measurement phase exhibits variety of distribution, depending on the data

type and its source.

The methods used to describe the parameters for classes of distribution are:

Probability

● It is based on assumed model of distribution.

● Used to find the chances of certain outcome/event to occur.

Statistics

● Uses the measured data to determine a model to describe the data used.

Inferential Statistics

● Describe the population parameters based on the sample data using a particular model.

4


Types of Distributions

The two types of distribution are as follows:

Discrete Distribution

● Binomial distribution

● Poisson distribution

Continuous Distribution

● Normal distribution

● Chi-square distribution

● t-distribution

● F-distribution

5


Discrete probability distribution is characterized by the probability mass function.

● It is important to be familiar with discrete distributions while dealing with discrete data.

● The two most useful discrete probability distributions are:

o Binomial distribution; and

o Poisson distribution.

● These distributions help in predicting the sample behavior that has been observed in a population.

Discrete Probability Distribution

6


Binomial Distribution

Binomial distribution is a probability distribution for the discrete data.

Predicts sample behaviorDescribes the discrete data as a result of a particular process

Used to deal with defective items

Characteristics of Binomial Distribution

Best suitable when the sample size is less than thirty and less than ten percent of the population

P R = nC r ∗ pr ∗ 1 − p n _ r

where, P(R) = probability of exactly (r) successes out of a sample size of (n)

p = probability of success; r = number of successes desired; n = sample size

7


Some of the key calculations of binomial distribution are shown.

Binomial Distribution (contd.)

Term Formula

Mean

𝜇 = 𝑛𝑝

where, n = sample size

p = probability of success

Standard Deviation

𝜎 = 𝑛𝑝(1 − 𝑝)

where, n = sample size

p = probability of success

Sample factorial calculation5! = 5 ∗ 4 ∗ 3 ∗ 2 ∗ 1 = 120

4! = 4 ∗ 3 ∗ 2 ∗ 1 = 24

8


Calculating Binomial Distribution—Example

Using binomial distribution formula, find the probability of getting 5 heads in 8 coin tosses.Q

● Tossing a coin has only two outcomes, Head or Tail.

● Outcomes are statistically independent.

Therefore,

p = probability of success = 0.5 (this remains fixed over time)

n = sample size = 8

r = number of successes desired = 5

P R = 8C 5 ∗ 0.55 ∗ 1 − 0.5 8 _ 5= 0.2187 = 21.87%

A

9


Poisson distribution is an application of the population knowledge to predict the sample behavior.

Poisson Distribution

Describes the discrete data

Deals with integers which can take any value

Used to analyze situations wherein the number of trials is large

Characteristics of Poisson

Distribution

Used where the probability of success in each trial is very small

Used for predicting the number of defects

10


The formula for the Poisson distribution is as follows:

Poisson Distribution—Formula

P ∗ =λx ∗ e−λ

∗ !where, P(x) = probability of exactly (∗) occurrences in a Poisson distribution (n)

λ = mean number of occurrences during interval

∗ = number of occurrences desired

e = base of the natural logarithm (equals 2.71828)

Mean of a Poisson Distribution (µ) = λ

Standard Deviation of a Poisson Distribution (σ) = λ!11


Calculating Poisson Distribution—Example

The past records of a road junction which is accident-prone show that the mean number of accidents every

week is 5 at this junction. Assume that the number of accidents follows a Poisson distribution and calculate

the probability of any number of accidents happening in a week.

Q

Assumption is the number of accidents follows a Poisson distribution

Given: =5 per week

Now, probability of zero accidents per week P 0 =5x ∗ e−5

0!= 0.006

Probability of exactly one accident per week P 1 =51 ∗ e−5

1!= 0.03

Probability of more than two accidents per week = 1 – [P(0)+P(1)+P(2)] = 1 – [0.006+0.03+0.08]

= 0.884 = 88.4%

A

12


Continuous probability distribution is characterized by the probability density function.

● A variable is said to be continuous if the range of possible values falls along a continuum.

Example: Loudness of cheering at a ball game, weight of cookies in a package, length of a pen,

or the time required to assemble a car.

● These distributions help in predicting the sample behaviour observed in a population.

Continuous Probability Distribution

13


Normal Distribution

The Normal or Gaussian distribution is a continuous probability distribution, illustrated as N (µ, σ).

● It has a higher frequency of values around the mean and fewer occurrences away from it.

● It is used as a first approximation to describe real-valued random variables that tend to cluster around a single mean value.

● It is a bell-shaped curve and is symmetrical.

● The total area under the normal curve p(x which is found in the distribution) = 1.

Normal Distribution with Mean = 100 and Standard Deviation = 10

14


In a normal distribution, to standardize comparisons of dispersion, a standard Z variable is utilized. The uses of Z value are as follows:

● It is unique for each probability within the normal distribution.

● It helps in finding probabilities of data points anywhere within the distribution.

● It is dimensionless with no units like mm, liters, coulombs, etc.

Normal Distribution (contd.)

Z =(Y − µ)

σwhere, Z = number of standard deviations between Y and the µ

Y = value of the data point in concern

µ = mean of the population

σ = standard deviation of the population

15


Calculating Normal Distribution—Example

Suppose the time taken to resolve customer problems follows a normal distribution with the mean value of

250 hours and standard deviation value of 23 hrs. What is the probability that a problem resolution will take

more than 300 hrs?

Q

Given:

● Y = 300

● µ = 250

● σ = 23

Using the formula: Z =(300−250)

23= 2.17

● From a Normal Distribution Table, the Z value of 2.17 covers an area of 0.98499 under itself● Thus, the probability that a problem can be resolved in less than 300 hrs is 98.5% ● The chances of a problem resolution taking more than 300 hours is 1.5%

A

16


Z-Table Usage

The probability of areas under the curve is 1. For the actual value, one can identify the Z score by using the Z-table.

17


Z-Table

This Z-table gives the probability that Z is between zero and a positive number.

This is the most commonly used normal distribution Z-table with the positive Z-scores.

Z 0.0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359

0.1 0.5398 0.5348 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753

0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141

0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517

0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879

0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224

0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549

0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852

0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133

0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389

1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830

1.2 0.8849 08869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015

18


Using Z-Table—Example

There is no need of the table to find the answer once you know that the variable Z takes a value of

less than (or equal to) zero.

● First, the area under the curve is 1, and second, the curve is symmetrical about Z = 0.

● Hence, there is 0.5 (or 50%) above chance of Z = 0 and 0.5 (or 50%) below chance of Z = 0.

Find the value of p of (Z less than 0).Q

A

19


Using Z-Table—Example (contd.)

The opposite or complement of an event A occurring is the event A not occurring.

P(not A) = 1 – P(A)

P(Z greater than 1.12) = 1 – P(Z less than 1.12)

Using the table:

P(Z < 1.12) = 0.5 + P(0 < Z < 1.12) = 0.5 + 0.3686 = 0.8686

Hence P(Z > 1.12) = 1 – 0.8686 = 0.1314

Find the value of p of (Z greater than 1.12).Q

A

20


Using Z-Table—Example (contd.)

● Z falls within an INTERVAL

Using the table:

P(Z lies between 0 and 1.12) = 0.3686

Find the value of p of (Z lies between 0 and 1.12).Q

A

21


Chi-square distribution (chi-squared or χ² distribution) with k-1 degrees of freedom is the distribution of the sum of the squares of k independent standard normal random variables.

Chi-Square Distribution

Degree of freedom (df) = k – 1, where k is the sample size.!

Most widely used probability distribution in inferential statistics

The distribution is used in a hypothesis test

Characteristics of χ²

Distribution

22


The formula for the Chi-square distribution is as follows:

Chi-Square Distribution—Formula

χcalculated2 = =

fO − fe 2

fe

where, χcalculated2 () = chi-square index

fO = observed frequency

fe = expected frequency

Chi-square distribution will be covered in detail in the later part of this lesson.!23


t-Distribution

A t-distribution is most appropriate to be used when:

● the sample size <30;

● population standard deviation is not known; and

● population is approximately normal.

The t-distribution approaches normality as the sample size increases.!24


The F-distribution is a ratio of two Chi-square distributions, and a specific F-distribution is denoted by the degrees of freedom for the numerator Chi-square and the degrees of freedom for the denominator Chi-square.

F-Distribution

Fcalculated =S12

S22

where, S1and S2 = standard deviations of the two samples

● If Fcalculated is 1, there is no difference in the variance

● If S1> S2, then the numerator should be greater than denominator (df1 = n1 – 1 and df2 = n2 – 1)

Refer F-table to find out critical F-distribution at α and degrees of freedom of samples of two different processes (df1 and df2)!

25


Analyze

Topic 2—Exploratory Data Analysis


Multi-Vari studies analyze variation, investigate process stability, identify investigation areas, and

break down the variation.

They classify variation sources into three major types:

Multi-Vari Studies

Positional

Variations within a single unit

where variation is due to location.

Examples: Pallet stacking in a

truck, temperature gradient in an

oven, variation observed from

cavity-to-cavity within a mold,

region of a country, line on invoice

Cyclical

Variations among sequential

repetitions over a short time.

Examples: Every n’th pallet

broken, batch-to-batch variation,

lot-to-lot variation, invoices

received day-to-day, and account

activity week-to-week

Temporal

Variations which occur over

longer periods of time.

Examples: Process drift,

performance before and after

breaks, seasonal and shift based

differences, month-to-month

closings, and quarterly returns

27


Create Multi-Vari Chart

Select Process and Characteristics

Decide Sample Size

Create a Tabulation

SheetPlot the Chart

Link the Observed

Values

Example: Select

the process

where the plate

is being

manufactured

and measure its

thickness within

a specified range.

Example: Sample

size is five pieces

from each

equipment and

the frequency of

data collection is

every two hours.

Example: The

tabulation sheet

with data records

contains the

columns with

time, equipment

number, and

thickness as

headers.

Example: Chart is

plotted with time

on X axis and the

plate thickness

on Y axis.

Example: The

observed values

are linked by

appropriate lines.

The five steps to create a Multi-Vari chart are:

28


Create Multi-Vari Chart (contd.)

The path to create a

Multi-Vari chart in

Minitab is:

Minitab > Stat > Quality

Tools > Multi-Vari Chart

29


Correlation is the association between variables. The Coefficient Correlation shows the strength of

the relationship between Y and X.

The statistical significance is denoted by correlation coefficient ‘r’.

Simple Linear Correlation

Movement in both variables is same

Movement in both variables is inverse

No correlation between the two variables

‘r’ or Pearson’s Coefficient of Correlation

Higher the absolute value of ‘r’, stronger the correlation between Y and X. An ‘r’ value of > + 0.85 or < - 0.85 indicates a strong correlation.!

+10-1

30


Correlation measures the linear

association between the output

(Y) and the input variable (X).

The patterns of correlation

displayed in scatter plots are:

● easy to see when the ‘r’

value is 0.9 and above; and

● difficult to see when the ‘r’

value is 0.5 or below.

Correlation Levels

31


The degree of movement of variable changes is calculated using regression.

If a high percentage of variability in Y (r2> 70%) is explained by changes in X, the model to write a

transfer equation is as follows:

This equation is used to:

● predict future values of Y given X, and X given Y; and

● regress Y on one or more X’s simultaneously.

Regression

Y = f(X)

Simple Linear Regression is for one X and Multiple Linear Regression is for more than one X.!32


There are two key concepts of regression:

Key Concepts of Regression

Transfer function to control Y

Y = f(X) may not be the correct transfer function to

control Y because there may be a low level of

correlation between the two variables.

Vital X

It is important to discover whether a statistical

significant relationship exists between Y and a

particular X by looking at p-values. Based on

regression, one can infer the vital X and eliminate the

rest.

It is important to understand if there is statistical relevance between Y and X using the metrics from Regression Analysis. The Simple Linear Regression should be used as a Statistical Validation tool.!

33


A simple linear regression equation is a fitted linear equation between Y and X. It is represented as

follows:

where,

● Y = Dependent variable / output / response

● X = Independent variable / input / predictor

● A = Intercept of fitted line on Y axis

● B = Regression coefficient / Slope of the fitted line

● C = Error in the model

Simple Linear Regression (SLR)

Y = A + BX ± C

34


If Y and X are not perfectly linear (r = ± 1), several lines could fit in the scatter plot. It can be inferred

from the graphs below:

● Minitab fits the line which has the least Sum of Squares of Error.

● In a linear relationship, the points would lie on the line. Typically, the data lies off the line.

● The distance from the point to line is the error distance used in the SSE calculations.

Least Squares Method in SLR

35


Years Fertilizer Expenses in $ (Y) Annual Selling in $ (X)

2009 2 20

2010 3 25

2011 5 34

2012 4 3

2013 11 40

2014 5 31

A farmer wishes to predict the relationship between the amount spent on fertilizers and the annual

sales of his crops. He collects the following data of last few years and determines his expected

revenue if he spends $8 annually on fertilizer.

SLR—Example

36


SLR using Microsoft Excel

The steps to perform Simple Linear Regression in Microsoft

Excel are as follows:

1. Copy the data from the cells B1 to C6 on an Excel

Worksheet.

2. Click Insert, and choose the Plain Scatter Chart (Scatter

with only Markers).

3. Right-click on the data points and choose “Add Trendline.”

4. Choose “Linear” and select the boxes titled, “Display R-

Squared value” and “Display equation.”

37


To use the data for Regression analysis, the interpretation of the scatter chart is as follows:

● The r2 value (Coefficient of Determination) conveys if the model is good and can be used. The r2

value is 0.3797.

● 38% of variability in Y is explained by X.

● The remaining 62% variation is due to residual factors.

● The low value of r2 validates a poor relationship between Y and X.

Regression Analysis Using Microsoft Excel

Refer to the Cause and Effect Matrix and study the relationship between Y and a different X variable. !

38


If a new variable X2 is added to the r2 model, the impact of X1 and X2 on Y gets tested. This is known as

Multiple Linear Regression. In Multiple Linear Regression:

● the value of r2 changes due to the introduction of the new variable.

● the resulting value of r2 is known as ‘r2 Adjusted.’

● the model can be used if ‘r2 Adjusted’ value is greater than 70%.

Multiple Linear Regression

39


Multiple Linear Regression covers the following concepts:

● The residuals between the actual value and the predicted value give an indication of how good the

model is.

● If the errors are small and predictions use X’s that are within the range of the collected data, the

predictions should be fine.

Key Concepts of Multiple Linear Regression

SST = SSR + SSE SSR = SST - SSE r2 = SSR ÷ SST

● To check for error, take two observations of Y at the same X.

● Prioritization of X’s can be done through the SLR equation; run separate regressions on Y with each X.

● If an X does not explain variation in Y, it should not be explored further.

40


A regression equation may denote a relationship between variables. It does not indicate:

● if change in one variable causes change in the other; and

● both the variables may be dependent on another independent variable.

Difference between Correlation and Causation

There is a positive

correlation between the

number of sneezes and the

deaths in the city. It cannot

be assumed that sneezing is

the cause of death though

the correlation is very strong.

41


Analyze

Topic 3—Hypothesis Testing


The differences between a variable and its hypothesized value may be statistically significant but may

not be practical or economically meaningful.

For example: Based on the hypothesis test, Nutri Worldwide Inc. implemented a trading strategy. The

returns:

● are economically significant when logical reasons are examined before implementation.

● may not be significant when statistically proven strategy is implemented directly.

● may be economically insignificant due to taxes, transaction costs, and risks.

Statistical and Practical Significance of Hypothesis Test

43


Null Hypothesis vs. Alternate Hypothesis

The conceptual differences between a null and an alternate hypothesis are as follows:

Null Hypothesis

● Represented as H0

● Cannot be proved, only rejected

● Example: Movie is good

Alternate Hypothesis

● Represented as Ha

● Challenges the null hypothesis

● Example: Movie is not good

If null hypothesis is rejected, alternative hypothesis must be right.!44


Type I and Type II Error

The conceptual differences between type I and type II error are as follows:

Type I Error

● Rejecting a null hypothesis when it is true

● Also known as Producer’s Risk

● ‘α’ is the chance of committing a Type 1 error

● The value of ‘α’ is 0.05 or 5%

● Example: When a movie is good, it is reviewed as ‘not good.’

Type II Error

● Accepting a null hypothesis when it is false

● Also known as Consumer’s Risk

● ‘β’ is the chance of committing a Type II Error

● The value of ‘β’ is 0.2 or 20%

● Any experiment should have as less β value as

possible

● Example: When a movie is not good, it is

reviewed as ‘good.’

45


While dealing with type I or type II errors, following are the points to remember:

● Probability of making one type of error can be reduced, leading to increasing the probability of

making the other type of error.

● If a true null hypothesis is erroneously rejected (Type I error), a false null hypothesis may be

accepted (Type II error).

● ‘α’ is set at 0.05, which means the risk of committing a type I error will be 1 out of 20 experiments.

● It is important to decide what type of error should be less and set ‘α’ and ‘β’ accordingly.

Important Points to remember about Type I and Type II Errors

46


The power of a test:

● helps in the probability of correctly rejecting the null hypothesis when it is false.

● is represented as 1-β. This is type II error.

● is the probability of not committing a type II error.

● helps in improving the advantage of hypothesis testing.

● with highest value should be preferred when given a choice of tests.

Power of Test

In hypothesis testing, ‘α’ is the significance level and ‘1-α’ is the confidence level.!47


The sample size is determined by the responses to the following questions:

● How much variation is present in the population? ( σ)

● At what interval does the true population mean need to be estimated? ( ±Δ)

● How much representation error is allowed in the sample? ( α)

The sample size for continuous data can be determined by the formula:

Determinants of Sample Size—Continuous Data

n =Z1−(

𝛼

2)∗ σ

∆

2

1- (𝛼

2)∗

= 0.975

48


To calculate the standard sample size for continuous data, the value of α is taken as 5%. According to Z

table, the Z97.5 = 1.96. The standardized sample size formula is:

Standard Sample Size Formula—Continuous Data

The population standard deviation for the time, to resolve customer problems, is 30 hours. What should be the size of a sample that can estimate the average problem resolution time within ± 5 hours tolerance with 99% confidence?

Q

Δ = 5, σ = 30, α =0.01, and Z99.5 = 2.575.Sample size = [(2.575*30)/5]2 = 238.70 = 239 A

n =1.96 ∗ σ

∆

2

for Continuous Data

49


To calculate the standard sample size for discrete data, if the average population proportion non-

defective is ‘p’, then population standard deviation can be calculated as:

Standard Sample Size Formula—Discrete Data

The non-defective population proportion for pen manufacturing is 80%. What should be the sample size to draw a sample that can estimate the proportion of compliant pens within ± 5% with an alpha of 5%?

Q

Δ = 0.05, σ 2= 0.8 (1-0.8), α = 0.05, and Z97.5 = 1.96 Sample size = (1.96/0.05)2 *0.8*0.2 = 245.86 = 246

A

σ = p(1 − p) n =1.96

∆

2

p(1 − p) for Discrete Data

Where ∆ = Tolerance allowed on either side of the population proportion average in %

50


The figure below helps in concluding the type of test one should perform based on the kind of data

and values available:

Hypothesis Testing Roadmap

Discrete data Continuous data

X2-test

F-test

t-test

Z-test

F-test

Variance Mean

Comparisonof two

Comparisonof many

Hypothesis testing

Mean Variance

σ unknown

σ known

51


H0: Average height of North American males is 165 cm (μ0)

Ha: Average height of Indian males < > 165 cm

H0: μ = μ0 against Ha: μ < > μ0

Sample size (n) = 117 (Z-test) and Sample size (n) = 25 (t-test); Sample average ( X) = 164.5 cm

Hypothesis Test for Means (Theoretical)—Example

Z-test (σ known)

The population SD is known; σ = 5.2

Compute z = ( X – μ0) / √(σ2/n) = (165 – 164.5) / √

(5.22/117) = 1.04

Reject H0 at level of significance α if z > zα

Since z0.05 = 1.96, the null hypothesis is not rejected at 5%

level of significance. Thus a conclusion based on the

sample collected is that the average height of North

American males is 165 cm.

t-test (σ unknown)

The population SD is unknown; however, it is estimated

from the sample SD; s = 5.0

Compute t = ( X- μ0) / √(s2/ n) = (165 – 164.5) / √(52 /25)=

0.5

Reject H0 at level of significance α if t > tn-1, αSince t24, 0.05 = 2.064, the null hypothesis is not rejected at

5% level of significance. Thus a conclusion based on the

sample collected is that the average height of North

American males is 165 cm.

52


In hypothesis test for variance Chi square test is used. This is explained in the example below:

Hypothesis Test for Variance—Example

H0: Proportion of wins in Australia or abroad is independent of the country played against

Ha: Proportion of wins in Australia or abroad is dependent on the country played against

χ2Critical = 6.251 and

χ2 Calculated = 1.36

Result: Since calculated value is less than the critical value, the proportion of wins of Australia hockey

team is independent of the country played or place.

53


The hypothesis test on population proportion can be performed as follows:

Hypothesis Test for Proportions—Example

H0: Proportion of smokers among males in a place named R is 0.10 (p0)

Ha: Proportion of smokers among males in R is different than 0.10

H0: p = p0 against Ha: p < > p0

Among n = 150 adult males interviewed, 23 were found smokers.

Sample proportion p = 23/150 = 0.153

Compute test statistic:

Reject H0 at level of significance α if z > zα

Since z0.05 = 1.96, the null hypothesis is rejected at 5% level of significance in favor of the alternative

Result: It can be concluded that the proportion of smokers in R is greater than 0.10.

54


Means of two processes are compared to:

● understand the significant difference in the outcome of the two processes;

● understand whether a new process is better than an old process;

● understand whether the two samples belong to the same population or a different population;

and

● benchmark the existing process with another process.

Comparison of Means of Two Processes

55


The two-mean t-test with unequal variances is:

● H0: μ1 = μ2 against Ha: μ1≠μ2

● Two samples of sizes n1 = 125 and n2 = 110 are taken from the two populations

● X1 = 167.3, X2 = 165.8, s1 = 4.2, s2 = 5.0 are the sample means and SDs respectively

● Compute test statistic

● Reject H0 at level of significance α if |Computed t|> tDF,α/2

● Since t223, 0.025 = 1.96, the null hypothesis is rejected at 5% level of significance

Paired Comparison Hypothesis Test for Means (Theoretical)

56


Paired Comparison Hypothesis Test for Variance—F-Test Example

Susan is examining the earnings of two companies. According to her, the earnings of Company A are more

volatile than those of Company B. She has been obtaining earnings data for the past 31 years for Company

A, and for the past 41 years for Company B. She finds that the sample standard deviation of Company A’s

earnings is $4.40 and of Company B’s earnings is $3.90. Determine whether the earnings of Company A

have a greater standard deviation than those of Company B at 5% level of significance.

H0 : σA2= σB

2 = the variance of Company A’s earnings is equal to the variance of Company B’s earnings.

Ha : σA2 < > σB

2 = the variance of Company A’s earnings is different.

σA2= variance of Company A’s earnings.

σB2= variance of Company B’s earnings.

Note: σA > σB. In calculating the F-test statistic, always put the greater variance in the numerator.

Q

A

57


The degrees of freedom for company A and company B are:

● dfA (degrees of freedom of A) = 31 – 1 = 30

● dfB (degrees of freedom of B) = 41 – 1 = 40

The critical value from F-table equals 1.74. The null hypothesis is rejected if the F-test statistic is

greater than 1.74.

Calculation of F-test statistic:

Results: The F-test statistic (1.273) is not greater than the critical value (1.74). Therefore, at 5%

significance level, the null hypothesis cannot be rejected.

Hypothesis Test for Equality of Variance—F-Test Example

F= (SA2/SB

2) = 4.402/3.902 = 1.273

58


Group A (Chef 1) Group B (Chef 2)

4.2 4

4.5 4.5

7.2 5

6.1 5.2

8.9 5.3

5.2 6.1

Hypothesis Tests—F-Test for Independent Groups

A restaurant wanting to explore the recent overuse of avocados suspects there is a difference

between two chefs and number of avocados used to prepare the salads. The table shows the measure

of avocados in ounces.

59


F-Test

The steps for conducting F-

Test in MS-Excel are:

1. Click Data Analysis

under Data tab.

2. Select F-Test Two-

Sample for Variances.

3. In Variable 1 and 2

range, select the right

data set.

4. Click Ok.

60


Before interpreting the F-test, the assumptions to be considered are as follows:

● Null Hypothesis: There is no significant statistical difference between the variances of the two

groups, thus concluding any variation could be because of chance. This is Common Cause of

Variation.

● Alternate Hypothesis: There is a significant statistical difference between the variances of the two

groups, thus concluding variations could be because of assignable causes also. This is Special Cause

of Variation.

F-Test Assumptions

61


The interpretations for the conducted F-test are as

follows:

● From the Excel result sheet, the p-value is 0.03.

● If p-value is < 0.05, null must be rejected.

● Null hypothesis with 97% confidence is rejected.

● The fact that variation could only be due to Common

Cause of Variation is rejected.

● There could be Assignable Causes of Variation or

Special Causes of Variation.

F-Test Two-Sample for Variances

Variable 1 Variable 2

Mean 6.016666667 5.016666667

Variance 3.197666667 0.517666667

Observations 66

df 5 5

F 6.177076626

P(F<=f) one-tail 0.033652302

F Critical one-tail 5.050329058

F-Test Interpretations

62


Group A (Chef 1) Group B (Chef 2)

4.2 4

4.5 4.5

7.2 5

6.1 5.2

8.9 5.3

5.2 6.1

Hypothesis Tests—t-Test for Independent Groups

The table shows the measure of avocados in ounces. If a significant difference in their means is found,

it can be concluded that there is a possibility of Special Cause of Variation.

63


The steps for conducting 2-sample t-test in MS-Excel are given below:

Open MS Excel, click Data and click

Data Analysis.

Select 2-Sample Independent t-test assuming unequal

variances.

In Variable 1 range, select the data set

for Group A.

In Variable 2 range, select the data set

for Group B.

Keep the “Hypothesized

Mean Difference” as 0.

Click Ok.

1 2 3

4 5 6

2-Sample t-Test

64


The assumptions for a 2-Sample Independent t-test are as follows:

● Null Hypothesis: There is no significant statistical difference between the means of the two groups,

thus concluding any variation could be because of chance. This is Common Cause of Variation.

● Alternate Hypothesis: There is a significant statistical difference between the means of the two

groups, thus concluding variations could be because of assignable causes also. This is Special Cause

of Variation.

H0 : Mean of Group A = Mean of Group BHa : Mean of Group A ≠ Mean of Group B

The alternate hypothesis tests two conditions, Mean of A < Mean of B and Mean of A > Mean of B. Thus a two-tailed probability needs to be used.!

Assumptions of 2-Sample Independent t-Test

65


2-Tailed vs. 1-Tailed Probability

The difference between the usage of the 2-tailed probability and one-tailed probability are as follows:

2-Tailed Probability

● If the alternate hypothesis tests more than

one direction, either less or more, use a 2-

tailed probability value from the test.

Example: If Mean of A is not equal to Mean of B,

then it is 2-tailed probability.

1-Tailed Probability

● If the alternate hypothesis tests one

direction, use a 1-tailed probability value

from the test.

Example: If Mean of A is greater than Mean of B,

then it is 1-tailed probability.

66


2-Sample Independent t-Test—Results and Interpretations

According to the table:

● The p-value of 2-tailed

probability testing is 0.24.

● This value is greater than

0.05.

● The null hypothesis is not

rejected.

● Both the groups are

statistically same.

t-Test: Two-Sample Assuming Unequal Variances

Variable 1 Variable 2

Mean 6.016666667 5.016666667

Variance 3.197666667 0.517666667

Observations 6 6

Hypothesized Mean 0

df 7

T Stat 1.270798616

P(T<=t) one-tail 0.122200546

T Critical one-tail 1.894578605

67


The paired t-test is:

● one of the powerful tests from the t-test family;

● conducted before and after the process to be measured; and

● often used in the Improve stage.

Paired t-Test

68

For example, a group of students score X in CSSGB before taking the Training program. Post the training program, the scores are taken again.

● One needs to find out if there is a statistical difference between the two sets of scores.

● If there is a significant difference, the inference could be that the training was effective.


Sample Variance (S2) is the average of the squared differences from the mean.

● It is used to calculate and understand the degree of variation of a sample.

● In statistics, its value is used by converting it into standard deviation and combining with the

mean.

The steps for calculating sample variance are as follows:

Sample Variance

Calculate the mean of the sample

Subtract each of the value from mean

Calculate the square value of the result

Take average of the squared differences

69


Sample Variance—Example

The example to calculate sample variance is as follows:

Consider the sample of weights. Suppose the mean value is 140 and when you subtract each value

from the mean, take the square value of the result, and then take the average of the squared

difference, the resulting sample variance value is 1936.

● In order to get the standard deviation, take the square root of the sample variance: √1936 = 44.

● The standard deviation along with the mean, will tell you how much the majority of the people

weigh.

o The mean value is 140 and variance is 44, the majority of people weigh between 96 pounds

(140 - 44) and 184 pounds (140 + 44).

70


ANOVA:

● is used to compare the means of more than two samples;

● stands for Analysis of Variance;

● helps in understanding that all sample means are not equal;

● based shortlisted samples can further be tested; and

● generalizes the t-test to include more than two samples.

ANOVA—Comparison of More Than Two Means

71


Outlet 1 Outlet 2 Outlet 3

48 50 49

49 48 48

48 36 39

53 50 49

58 50 34

50 62 33

46 45 57

50 47 48

49 51 47

47 44 39

ANOVA Example

The table shows the takeaway food delivery time of

three different outlets. To benchmark the delivery

time of the outlets:

● the null hypothesis will assume that the three

means are equal; and

● rejection of the null hypothesis would mean that at

least two outlets are different in their average

delivery time.

72


Using Minitab for ANOVA

To perform ANOVA in Minitab:

1. Stack the data into two

columns.

2. In the main menu, choose

Stat > ANOVA > One-Way.

3. Select the response, delivery

time, factor, and outlet.

4. Click OK.

73


Using Minitab for ANOVA (contd.)

The following output is received when the data is fed into the Minitab:

74


ANOVA using Excel

To perform ANOVA, enter the data on an Excel spreadsheet, select the ANOVA-single factor test from

the Data Analysis “Toolpak,” and select the array for analysis and an output range.

75


The result of the Minitab ANOVA is interpreted as follows:

● Since p-value is more than 0.05, the null hypothesis is accepted.

● There is no significant difference between the means of delivery time for three outlets.

● Based on the confidence intervals, it is found that the intervals overlap.

Interpreting Minitab Results

In one-way ANOVA, one factor has to be benchmarked unlike the two-way ANOVA.!

76


The Chi-square distribution (χ²-distribution) or Chi-squared:

● is a widely used probability distribution in inferential statistics;

● needs one sample for the test to be conducted; and

● with k-1 degrees of freedom is the distribution of a sum of the squares of k independent standard

normal random variables.

Chi-Square Distribution

𝒳2Calculated = Σ

f0−fe2

fe

Where,

• 𝒳2Calculated = chi-square index

• Fo = An observed frequency

• Fe = An expected frequency

77


Chi-Square Test—Example

To analyze the Australian hockey team’s wins,

the data has two classifications:

● The table is called a 2 X 4 contingency

table.

● Expected frequency for each of the

observed frequencies = (row total)(column

total)/overall total.

Example: Observed frequency of 3 wins

against South Africa in Australia would convert

to expected frequency of (21 / 31) * 5 = 3.39

Estimated Population Parameters

Sample Statistics

78


Chi-Square Test—Example (contd.)

The table is populated by:

● calculating and adding the estimated

population parameters;

● estimating the observed frequency; and

● calculating the final chi-square index.

79


There is a different chi-square distribution for each different number of degrees of freedom. For chi-

square distribution, degrees of freedom are calculated as per the number of rows and columns in the

contingency table.

Chi-Square Test—Example: Interpretation of Results

The calculated value is found to be less than the critical value.

Degrees of freedom = (2 - 1)*(4-1) = 3

Assuming α = 10%, 𝒳2Critical = 6.251

𝒳2Calculated = 1.36

𝒳2

Critical divides region into acceptance and rejection zones while 𝒳2Calculated allows

accepting or rejecting the null hypothesis depending on which zone it falls.

80


Analyze

Topic 4—Hypothesis Testing with Non-Normal Data


Mann-Whitney or Wilcoxon Rank Sum test is a non-parametric test used to compare two unpaired

groups. In this test:

● The value of is set as 0.05.

● The rejection and acceptance condition remains the same for different cases:

Mann-Whitney Test

The aim of this test is to rank the entire data available for each condition and then compare the total outcome of the two ranks.!

If p< Reject null hypothesis

If p> Cannot reject null hypothesis, accept null hypothesis

82


Mann-Whitney Test

Rank all the values from low to high

Find the average of the ranks for all the identical values

Test the values

● The smallest number gets a

rank of 1.

● The largest number gets a

rank of n, where n is the total

number of values in the two

groups.

● Continue till all the whole-

number ranks are used.

● Summate the ranks for the

observations from sample 1

and then summate the rank

in sample 2 (larger group).

The steps to perform Mann-Whitney test are as follows:

83


Mann-Whitney Test—Example

Group Data

G1

14

2

5

16

9

G2

4

2

18

14

8

Sorted Data Group Rank A

2 G1 1

2 G2 2

4 G2 3

5 G1 4

8 G2 5

9 G1 6

14 G1 7

14 G2 8

16 G1 9

18 G2 10

Final Rank

1.5

1.5

3

4

5

6

7.5

7.5

9

10

Avg. = 1.5

Avg. = 7.5

An example of performing Mann-Whitney test is shown here.

G1 Rank (R1)

1.5

4

6

7.5

9

Total = 28

n1 = 5

G2 Rank (R2)

1.5

3

5

7.5

10

Total = 27

n2 = 5

84


Mann-Whitney Test—Example (contd.)

The formula for the Mann-Whitney U test for n1 and n2 values is:

U1 = n1 × n2 +[n1(n1 + 1)]

2 −R1

U2 = n1 × n2 +[n2(n2 + 1)]

2 −R2

In this example,

U1 = 12 and U2 = 13


Mann-Whitney Test—Example (contd.)

To calculate the U value:

● U = Min (U1, U2) = Min (12, 13) = 12

● Lookup the Mann-Whitney U test table for n1 = 5 and n2 = 5.

● To be statistically significant, the obtained U value must be equal to or less than this critical value.● Since the calculated U value is 12 (not less than 2), there is no statistical difference between the mean of

the two groups.!

86


The Kruskal-Wallis test is also a non-parametric test used for testing the source of origin of the

samples.

Characteristics of Kruskal-Wallis test are as follows:

● Only way to analyze the variance by ranks.

● Medians of two or more samples are compared to find the source of origin of the sample.

● Unlike the analogous one-way analysis of variance, it does not assume the normal distribution of

the residuals.

Kruskal-Wallis Test

● Null hypothesis is when medians of all the groups are equal, and● Alternative hypothesis is when at least one population median of one group is different than that of at

least one other group.!

87


The Mood’s median is a non-parametric test that is used to test the equality of medians from two or

more different populations. This test works when:

● the output (Y) variable is continuous, discrete-ordinal or discrete-count, and

● the input (X) variable is discrete with two or more attributes.

The steps involved in Mood’s Median test are as follows:

Mood’s Median Test

Find median of the combined data set

Find the number of values in each sample > median

Find expected value for each cell

Find chi-square value

Form a contingency table

88


Friedman test is a form of non-parametric test that does not make any assumptions on the shape and

origin of the sample.

● It allows smaller sample data sets to be analysed, and

● Unlike ANOVA, it does not require the dataset to be randomly sampled from normally distributed

populations with equal variances.

Friedman Test

The test uses null hypothesis where the population medians of each treatment are statistically identical to

the rest of the group.!

89


The 1 Sample Sign test is the simplest of all the non-parametric tests that can be used instead of a

one sample t test.

● Here, H0 is the hypothecated median or assumed median of the sample, which belongs to the

population.

Steps involved in 1 Sample Sign test are as follows:

1 Sample Sign Test

Count the number of positive values

Count the number of negative values

Test the values

Values that are larger than

hypothesized median

Values that are smaller than

the hypothesized median

Check if there are significantly

more positives (or negatives)

than expected

90


The 1 Sample Wilcoxon test also known as Wilcoxon Signed Rank test is a non-parametric test.

This test is:

● equivalent to parametric One Sample t-Test, and

● powerful than non-parametric 1 Sample Sign Test.

1 Sample Wilcoxon Test

91


Some characteristics of this test are as follows:

● It assumes the existing sample is randomly taken from a population, with a symmetric frequency

distribution around the median, and

● The symmetry can be observed with a histogram, or by checking if the median and mean are

approximately equal.

Characteristics of 1 Sample Wilcoxon Test

The conclusion in this test is that if the value is on the mid-point, you can continue and accept the null

hypothesis. If not, reject the alternate hypothesis.!

92


An example of Sample Wilcoxon test is shown.

1 Sample Wilcoxon Test—Example

The Median customer satisfaction score of an organization has always been 3.7 and the management

wants to see if this has changed. They conducted a survey and got the results grouped by the

customer type.

Conclusion:

● If median = 3.7 = Accept H0

● If median ≠ 3.7 = Reject Ha

● α = 0.05

93


Quiz


a.

b.

c.

d.

QUIZ Which of the following describes the population parameters based on the sample data using a particular model?1


Probability

Correlation

Statistics

95


Copyright 2012-2014,Simplilearn,All rights reserved

a.

b.

c.

d.

QUIZ Which of the following describes the population parameters based on the sample data using a particular model?


Probability

Correlation

Statistics

1

Answer: b.

Explanation: Inferential statistics describe the population parameters based on the sample data using a particular model.

96


a.

b.

c.

d.

QUIZ Which of the following is an application of the population knowledge to predict the sample behavior?2

Normal distribution

Chi-square distribution

Probability distribution

Poisson distribution

97



a.

b.

c.

d.

QUIZ Which of the following is an application of the population knowledge to predict the sample behavior?

Normal distribution


Probability distribution


2

Answer: a.

Explanation: Poisson distribution is an application of the population knowledge to predict the sample behavior.

98


a.

b.

c.

d.

QUIZ Which of the following is used to calculate the degree of movement of variable Y as X changes?3

Probability

F-distribution

Regression

Correlation

99



a.

b.

c.

d.

QUIZ Which of the following is used to calculate the degree of movement of variable Y as X changes?

Probability

F-distribution

Regression

Correlation

3

Answer: d.

Explanation: The degree of movement of variable Y as X changes is calculated using regression.

100


a.

b.

c.

d.

QUIZ A null hypothesis states that a process has not improved as a result of some modifications. The type II error is to conclude that:

we have failed to reject the null hypothesis (H0) when it is false.

we have rejected the null hypothesis.

we have made a correct decision with alpha probability.

we have failed to reject the null hypothesis (H0) when it is true.

4

101



a.

b.

c.

d.

QUIZ A null hypothesis states that a process has not improved as a result of some modifications. The type II error is to conclude that:

we have failed to reject the null hypothesis (H0) when it is false.

we have rejected the null hypothesis.

we have made a correct decision with alpha probability.

we have failed to reject the null hypothesis (H0) when it is true.

4

Answer: b.

Explanation: A type II error means that we have failed to reject the null hypothesis (H0) when it is false.

102


a.

b.

c.

d.

QUIZThe test used for testing significance in an analysis of variance table is the:

t-test.

F-test.

Chi-square test.

Z-test.

5

103



a.

b.

c.

d.

QUIZThe test used for testing significance in an analysis of variance table is the:

t-test.

F-test.

Chi-square test.

Z-test.

5

Answer: c.

Explanation: The appropriate ANOVA test is the F-test. ANOVA is a test of the equality of means.

104


a.

b.

c.

d.

QUIZWhich of the following is the only way to analyze the variance by ranks?

6

1 Sample Size test

Friedman test

Kruskal-Wallis test

1 Sample Wilcoxon test

105



a.

b.

c.

d.

QUIZWhich of the following is the only way to analyze the variance by ranks?

1 Sample Size test

Friedman test

Kruskal-Wallis test


6

Answer: d.

Explanation: The Kruskal-Wallis test is the only way to analyze the variance by ranks.

106


a.

b.

c.

d.

QUIZ What distribution is used while making inferences about a population variance based on a single sample from that population?

Normal distribution

t-distribution

F-distribution


7

107



a.

b.

c.

d.

QUIZ What distribution is used while making inferences about a population variance based on a single sample from that population?

Normal distribution

t-distribution

F-distribution


7

Answer: a.

Explanation: The chi-square distribution is used to compare a sample variance with a known population variance.

108


a.

b.

c.

d.

QUIZIf p-value is less than the significant value, the null hypothesis has to be:

accepted.

maintained as it is.

re-evaluated.

rejected.

8

109



a.

b.

c.

d.

QUIZIf p-value is less than the significant value, the null hypothesis has to be:

accepted.

maintained as it is.

re-evaluated.

rejected.

8

Answer: a.

Explanation: If the p-value is less than the significant value, the null hypothesis has to be rejected as the data is not supporting the null hypothesis and the difference will be statistically significant.

110


a.

b.

c.

d.

QUIZ Which of the following is a nonparametric test that is used to test the equality of medians from two or more different populations?9

Kruskal-Wallis test

Friedman test

1 Sample Sign test

Mood’s median test

111



a.

b.

c.

d.

QUIZ Which of the following is a nonparametric test that is used to test the equality of medians from two or more different populations?

Kruskal-Wallis test

Friedman test

1 Sample Sign test


9

Answer: a.

Explanation: The Mood’s median is a nonparametric test that is used to test the equality of medians from two or more different populations.

112


a.

b.

c.

d.

QUIZWhich of the following is a ratio of two chi-square distributions?

10

t-distribution


Binomial distribution

F-distribution

113



a.

b.

c.

d.

QUIZWhich of the following is a ratio of two chi-square distributions?

t-distribution


Binomial distribution

F-distribution

10

Answer: a.

Explanation: The F-distribution is a ratio of two chi-square distributions.

114


a.

b.

c.

d.

QUIZ Which of the following is the probability of correctly rejecting the null hypothesis when it is false?11

Power of a test

Simple linear regression

1 Sample Sign test

Simple linear correlation

115



a.

b.

c.

d.

QUIZ Which of the following is the probability of correctly rejecting the null hypothesis when it is false?

Power of a test

Simple linear regression

1 Sample Sign test

Simple linear correlation

11

Answer: b.

Explanation: The power of a test is the probability of correctly rejecting the null hypothesis when it is false.

116


a.

b.

c.

d.

QUIZ Which of the following assumes that the existing sample is randomly taken from a population, with a symmetric frequency distribution around the median?12



Friedman test

Kruskal-Wallis test

117



a.

b.

c.

d.

QUIZ Which of the following assumes that the existing sample is randomly taken from a population, with a symmetric frequency distribution around the median?



Friedman test

Kruskal-Wallis test

12

Answer: c.

Explanation: 1 Sample Wilcoxon test assumes that the existing sample is randomly taken from a population, with a symmetric frequency distribution around the median.

118


● Discrete probability distribution is characterized by the probability mass

function and continuous probability distribution is characterized by the

probability density function.

● Multi-Vari studies are used to analyze variation in a process.

● Correlation means association between variables. Simple Linear Regression

and Multiple Regression are its two main techniques.

● Hypothesis testing is conducted on different sets of data. Analysis of

variance is used to compare the means of more than two sample sets.

● A t-test is used for 1-sample and 2-sample tests are used for comparing two

means.

Here is a quick recap of what we have learned in this lesson:

Summary

119


● The Mann-Whitney or Wilcoxon Rank Sum test is used to compare two

unpaired groups.

● The Kruskal–Wallis Test is used for testing the source of origin of samples.

● The Mood’s median test is used to test the equality of medians from two

or more different populations.

● The Friedman test does not make any assumptions on the shape and

origin of the sample.

● The 1 Sample Sign test is the simplest of all the non-parametric tests that

can be used in the place of a 1 sample t-test.

Here is a quick recap of what we have learned in this lesson:

Summary (contd.)

120



THANK YOU

Download - LSSGB Lesson4 Analyze

Top Related