inpo statistics workshop: notes and exercises using excel and sas developed by jennifer lewis...

101
INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

Upload: eugene-richardson

Post on 28-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS

Developed by Jennifer Lewis Priestley, Ph.D.

Kennesaw State University

Page 2: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Statistics Workshop Topics

MODULE ONE: Concept Review• Review of Statistical Concepts• Data Analysis using EXCEL

MODULE TWO: Inferential testing using EXCEL and SAS• Confidence Intervals• Ttests• ANOVA• Chi-Square

MODULE THREE: Predictive Modeling using EXCEL and SAS• Regression Analysis• Logistic Analysis• Discriminant Analysis

Page 3: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Review of Statistical Concepts

Prior to analysis, a determination must be made of the type of variables in question. Variable type will, in many cases, dictate the analysis options. Variable types for review:• Qualitative

– categorical (e.g., gender, race)– ordinal (e.g., rankings, Likert data*)

• Quantitative– interval (e.g., temperature)– ratio (e.g., weight, height)

* Mathematicians and Statisticians consider Likert data to be qualitative and therefore restrict its use to qualitative techniques such as Chi-square analysis. However, in practice, most people treat Likert data as quantitative and utilize quantitative techniques.

Page 4: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Descriptions of data typically include:• Measurements of Central Tendency

– Mean, Median, Mode (do you know when to use each?)

• Measurements of Dispersion– Standard Deviation and Variance

Outlier Detection:• For near bell-shaped data

– Use 3-Sigma (Empirical) Rule : any value that is more than 3 standard deviations above or below the mean

• For Skewed Data– Use Tukey’s Rule: any value that is more than one

step below Q1 or more than 1 step above Q3; A step = 1.5*IQR

MODULE 1: Review of Statistical Concepts

Page 5: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

The Central Limit Theorem forms the basis for why inferential statistics (versus descriptive statistics) is possible.

Prior to reviewing the Theorem, pull up this site:

http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

MODULE 1: Review of Statistical Concepts

Page 6: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

• The distribution of sample means will, as the number of samples increases approach a normal distribution;

• The mean of all sample means approximates the population mean;• The std of all sample means is the std of the population/the SQRT of

the sample size;• If the population is NOT normally distributed, sample sizes must be

greater than 30 to assume normality;• If the population IS normally distributed, samples can be of any size

to assume normality (although greater than 30 is always preferred).

MODULE 1: Review of Statistical Concepts

Important concepts to remember about the Central Limit Theorem:

Page 7: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

A few points to remember about the Normal Distribution:• Bell shaped and symmetric about the mean μ.• Mean = μ, Median = μ, Mode = μ. • The area under the normal curve below μ is .5. • Probability that a Normal Random Variable Outcome:

– Lies within +/- 1 std dev of the mean is .6826– Lies within +/- 2 std dev of the mean is .9544– Lies within +/- 3 std dev of the mean is .9974

• For all other probabilities, convert the relevant observation to a Z-score: Z=(x- μ)/ σ

• Any observation that has a Z-score greater than 2 is typically considered to be a statistical outlier…since its probability of occurrence is less than 5%.

MODULE 1: Review of Statistical Concepts

Page 8: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Review of Statistical Concepts

Ho is true Ho is false

Reject Ho TYPE I Error Valid Decision

Do not reject Ho

Valid Decision TYPE II Error

Hypothesis Testing:

Page 9: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

1. Statement of Hypotheses— H0: Null Hypothesis – opposite of Alternative Hypothesis — H1: Alternative Hypothesis – what we are trying to prove— Evaluate Type I & Type II Errors

2. Set Significance level, — Standard of Proof, or Level of Risk— Represents Probability of Type I Error

3. Calculate the test statistic from the sample.

4. Calculate the p-value (strength of the evidence)— 1. If the p-value < , Accept H1. — 2. If the p-value > , Accept H0.

MODULE 1: Review of Statistical Concepts

Page 10: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

In this section, we will use EXCEL to execute some of the most common types of univariate/bivariate and multivariate data analysis:

• Descriptive Statistics• Histograms• Scatterplots and Charts• Pivot Tables• Using Formulas (fx)• Look Up Tables• Lagniappe

MODULE 1: Data Analysis Using EXCEL

Page 11: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: COMP1 Dataset

• Speed = Amount of time it takes to deliver a product (in days) once the order has been received.

• PriceLv = Perceived level of price charged by suppliers

• PriceFlx = Perceived willingness of COMP1 managers to negotiate price

• Man_Imag = Overall image of manufacturer

• Service = Overall level of service necessary to maintain a satisfactory relationship with customer

• Sal_Imag = Overall perceived image of salesforce

• Quality = Perceived level of product quality

The dataset used throughout this workshop is the COMP1 dataset, with 14 variables and 100 observations.

• Size = Large (1) or Small (0)• Usage = Percentage of total product

purchased from COMP1• Satisf = How satisfied purchaser is with

Comp1• SpecBuy = Extent to which a purchaser

evaluates each purchase separately (1=each purchase evaluated separately, 0 = lot buying)

• Procure = Centralization of purchase decisions (1= Centralized, 0=decentralized)

• Ind_Type = Classification of Industry affiliation (1=Industry A, 0=other)

• Buy_Sit = Type of buying situation (NEW = New, MOD = Modified purchase, REP = Repeat Purchase

Page 12: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL

Using the COMP1 dataset, determine the following descriptive statistics for each quantitative variable:

• The most appropriate measurement of central tendency;• Two measurements of dispersion;• For one variable, identify if any outliers exist.

Determine these statistics using the f(x) options:=AVERAGE, =MEDIAN, =MODE, =STDEV, =VAR

Page 13: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Using the COMP1 dataset, determine the following descriptive statistics for each quantitative variable:

• The most appropriate measurement of central tendency;• Two measurements of dispersion;• For one variable, identify if any outliers exist.

This time, use the “Descriptive Statistics” option:

TOOLDATA ANALYSISDESCRIPTIVE STATISTICS

MODULE 1: Data Analysis Using EXCEL

Page 14: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL

Using the COMP1 dataset, develop a histogram for the quant variable of your choice (use three categories).

EXAMPLE: Take the Usage variable and subtract the min value from the max value (65-25 = 40). Divide that number (the range) by the number of desired categories (40/3 = 13.33). Now, “massage” that figure as necessary to create reasonable, logical categories of approximately equal size:

• 25-39• 40-54• 55-65

Page 15: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL

A few points about histograms in EXCEL:

• Create a column over to the right that contains only the TOP of each category that you have assigned

• Label this category “BIN RANGE”

• TOOLSDATA ANALYSISHISTOGRAM

• Ensure that the “Labels” box is ticked

• Ensure that the “Cumulative Percentage” and “Chart Output” boxes are ticked

Page 16: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL

Using the COMP1 dataset, develop a scatterplot of two quantitative variables of your choice.

EXAMPLE: Chart the two variables Usage and Price Level.

NOTE: it is helpful to move the two variables into columns next to each other

Page 17: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL

Using the COMP1 dataset, develop a pivot table of the entire dataset.

DATAPIVOT TABLES

NOTE: ensure that the labels are included in the table

Page 18: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL

A few more widely used examples of functions in EXCEL:

=IF=AND=ABS=SUMPRODUCT=RAND=RANDBETWEEN

Page 19: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL

Look up tables in EXCEL can be a very helpful way of converting quantitative data into categorized qualitative data…which sometimes can be easier to work with.

Use the VLOOKUP function to categorize the Speed variable into “EXCELLENT” “AVERAGE” and “POOR” events.

Clarification of EXCEL notation :VLOOKUP(enter the speed value here,enter the range of the two column table here,enter the column number of the desired category label here)

Page 20: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Lagniappe: This word derives from New World Spanish la ñapa, “the gift,”. The word came into the Creole dialect of New Orleans and there acquired a French spelling. It is still used in the Gulf States, especially southern Louisiana, to denote a little bonus that a friendly shopkeeper might add to a purchase.

One Lagniappe presented here is the process of selecting “the best” option from among several alternatives.

Page 21: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Lagniappe for everyone!

Page 22: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Statistics Workshop Topics

MODULE ONE: Concept Review• Review of Statistical Concepts• Data Analysis using EXCEL

MODULE TWO: Inferential testing using EXCEL and SAS• Confidence Intervals• Ttests• ANOVA• Chi-Square

MODULE THREE: Predictive Modeling using EXCEL and SAS• Regression Analysis• Logistic Analysis• Discriminant Analysis

Page 23: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals

We always prefer to use descriptive statistics. However, often we are forced to take a sample and use inferential statistics because of issues related to cost, time, money or access.

When taking a sample, we can estimate a population parameter such as a mean or a proportion using the sample statistic (which is not very accurate) or we can calculate a confidence interval.

Page 24: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Confidence Intervals are calculated using two formulas:

CI for the population mean:

x+ Z*(s)/SQRT(n)

where, x = the sample mean

Z = Z-score (90%CI = 1.645, 95%CI = 1.96, 99%CI = 2.575)

s = sample standard deviation

n = sample size

Note: the part of the expression after the + is called the Margin of Error Z*(s)/SQRT(n).

MODULE 2: Inferential Testing – Confidence Intervals

Page 25: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

The CI for the population proportion is represented by:

p+ Z*SQRT((pq)/n)

where, p = the sample proportion

Z = Z-score (90%CI = 1.645, 95%CI = 1.96, 99%CI = 2.575)

q=1-p

n = sample size

Note: the part of the expression after the + is called the Margin of Error Z*(s)/SQRT(n).

MODULE 2: Inferential Testing – Confidence Intervals

Page 26: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals

Fun Manual Calculation! using the Gallup Website:

http://poll.gallup.com/

Replicate the Gallup prediction using the second CI formula.

Page 27: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals

More fun calculations! Using the Comp1 dataset, calculate the 95% CI for a quantitative variable using formula 1.

=CONFIDENCE Alpha is the accepted prob. of making a T1 error. It is also 1-the confidence level.

Page 28: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals

The result generated from =CONFIDENCE(.05, 1.32, 100) is 0.258857. What does this number mean?

This is the Margin of Error. In other words, if you were to report the 95% confidence interval for this company’s speed, you would report:

3.52 + .25

Or

3.25 to 3.77

Page 29: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Lets execute CIs in SAS. First, calculate the 95% CI for the quantitative variable Speed. Use the following code:

Proc Import datafile = "c:\COMP1.xls" OUT = COMP1 DBMS = "EXCEL97" Replace;run;

Proc Print data=Comp1;Run;

Proc Means data=Comp1 CLM alpha=.05;Var Speed;Run;

MODULE 2: Inferential Testing – Confidence Intervals

Should you require a higher or lower alpha (.01 is more conservative and .10 is more risk tolerant), change the .05 as appropriate.

Page 30: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals

The output is pretty simple:

Notice that this is the same interval from the EXCEL output. Isnt it nice when numbers match?

Page 31: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Now, calculate the 95% CI for the qualitative variable Procure. Use the following code:

Proc Freq data=Comp1;

Tables Procure/Binomial alpha=.05;

Run;

MODULE 2: Inferential Testing – Confidence Intervals

Note that EXCEL does not readily support this calculation.

Page 32: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals

The output is pretty simple:

Page 33: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing

Dependent Variable

Independent (predictor) Variable

Hypothesis Test Comments

Categorical (Qualitative)

Categorical (Qualitative)

Chi-Square Tests if variables are statistically independent (i.e. are they related or not?)

Quantitative

Categorical (Qualitative)

T-TESTANOVA

Determines if categorical variable (factor) affects dependent variable; Ttests for 1 or 2 groups and ANOVA for 3 or more.

Quantitative

Quantitative Regression Analysis

Test establishes a regression model; used to explain, predict or control dependent variable

Page 34: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

The Chi-Square Test is used to determine if two QUALITATIVE variables are related.

The Chi-Square statistic is computed using the following formula:

X2 = Σ(fo-fe)2/fe

Where:fo is the frequency of the observed valuefe is the frequency of the expected value

This calculated test statistic is converted into a p-value to evaluate the presence of a relationship (or not).

MODULE 2: Inferential Testing

Page 35: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Using the COMP1 dataset, determine if a company’s buying situation (BuySit) is related to the centralization of their purchasing decisions (Procure).

The hypothesis statements for this test are:

Ho: BuySit and Procure are NOT related

H1: BuySit and Procure ARE related

Develop the appropriate testing matrix and identify the Type1 and Type2 errors.

MODULE 2: Inferential Testing

Page 36: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Now, using EXCEL, develop the 2x3 matrix of these two variables using a pivot table (place “count of procure” in the center).

You should see this:

This table includes the “frequencies of the observed” values from our Chi-Square formula.

MODULE 2: Inferential Testing

Count of Procure BuySitProcure MOD NEW REP Grand Total

0 10 8 32 501 22 26 2 50

Grand Total 32 34 34 100

Page 37: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Now, we need to determine the expected values. If there is NO relationship, then we would expect to see exactly 32% of the decentralized procurement companies with a modified buying situation, 34% with a new buying situation, etc.

The matrix of expected values looks like this:

MODULE 2: Inferential Testing

MOD NEW REP TOTAL0 16 17 17 501 16 17 17 50

32 34 34 100

=(32/100)*50

Page 38: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Now use the =CHITEST formula in EXCEL. The actual range requested is the INTERIOR of the observed matrix and the expected range is the INTERIOR of the expected matrix (note that the marginal values are the same for the two matrices).

MODULE 2: Inferential Testing

The resulting value is: 1.61E-09 or .00000000161.

Page 39: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

This is a p-value. Given the rule : if p<a reject Ho and p>a accept Ho…

Can we conclude that there is a relationship between the centralization of the purchasing decision and the buying situation?

MODULE 2: Inferential Testing

Page 40: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Lets execute the same test in SAS, using the following code:

MODULE 2: Inferential Testing

Proc Freq data=Comp1;

Tables Procure*BuySit/CHISQ;

Run;

Page 41: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

This code creates the following output:

MODULE 2: Inferential Testing

This legend provides the key to interpreting these numbers

Page 42: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Here is the second part of the output:

MODULE 2: Inferential Testing

This is the same p-value that we obtained in EXCEL

Page 43: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing

Any additional questions on Chi-Square?

Page 44: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing

Dependent Variable

Independent (predictor) Variable

Hypothesis Test Comments

Categorical (Qualitative)

Categorical (Qualitative)

Chi-Square Tests if variables are statistically independent (i.e. are they related or not?)

Quantitative

Categorical (Qualitative)

T-TESTANOVA

Determines if categorical variable (factor) affects dependent variable; Ttests for 1 or 2 groups and ANOVA for 3 or more.

Quantitative

Quantitative Regression Analysis

Test establishes a regression model; used to explain, predict or control dependent variable

Page 45: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Ttests represent the most common form of statistical testing. It involves either one sample or two independent samples.

1.One Sample Ttest - compares the mean of the sample to a given number.

• e.g. Is average monthly revenue per customer who switches >$50 ?

Formal Hypothesis Statement examples:

H0: $50

H1: > $50

H0: = $50

H1: $50

MODULE 2: Inferential Testing

Page 46: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

2. Two Sample Ttest - compares the mean of the first sample minus the mean of the second sample to a given number.

• e.g. Is there a difference in the production output of the two facilities?

Formal Hypothesis Statement examples:

H0: a b

H1: a > b

H0: a = b

H1: a b

MODULE 2: Inferential Testing

Page 47: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

3. Paired Sample Ttest - compares the mean of the differences in the observations to a given number.

e.g. Is there a difference in the production output of a facility after the implementation of new procedures?

Formal Hypothesis Statement example:

H0: post - pre <=0

H1: post - pre > 0

MODULE 2: Inferential Testing

Page 48: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing using EXCEL

Using the COMP1 dataset and EXCEL, determine if there is a difference in the overall satisfaction of large versus small companies (the large companies will represent a sample and the small companies will represent a second sample).

Determine the null and alternative hypothesis statements for this question…then develop the 2x2 hypothesis matrix…including the Type1 and Type2 errors.

Now, sort the data by size. The satisfaction values associated with the 0s (the small firms) will be our first array of numbers and the satisfaction values associated with the 1s (the large firms) will be our second array of numbers.

Page 49: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing using EXCEL

Satisfaction of small firms

Satisfaction of large firms

Two tailed test

homoscedastic

Your computed value should be 1.80363E-06…what do you conclude?

Page 50: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing using SAS

Using the COMP1 dataset and SAS, determine if there is a difference in the overall satisfaction of large versus small companies.

Here is the necessary code:

Proc Ttest data=Comp1;Var Satisf;Class Size;Run;

the quantitative variable of interest

the qualitative variable which identifies the two samples

Page 51: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing using SAS

Page 52: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing using SAS

Using the COMP1 dataset and SAS, determine if the overall satisfaction is more than 5.0.

Develop the hypothesis statements and develop the 2x2 matrix.

Here is the necessary code:

Proc Ttest data=Comp1 H0=5.0;

Var Satisf;

Run;

Page 53: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing using SAS

Page 54: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing

Any additional questions on Ttests?

Page 55: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing

Dependent Variable

Independent (predictor) Variable

Hypothesis Test Comments

Categorical (Qualitative)

Categorical (Qualitative)

Chi-Square Tests if variables are statistically independent (i.e. are they related or not?)

Quantitative

Categorical (Qualitative)

T-TESTANOVA

Determines if categorical variable (factor) affects dependent variable; Ttests for 1 or 2 groups and ANOVA for 3 or more.

Quantitative

Quantitative Regression Analysis

Test establishes a regression model; used to explain, predict or control dependent variable

Page 56: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing - ANOVA

As with Ttests, ANOVA is a common test to determine if differences exist among samples. Where Ttests evaluate either one or two samples (groups), ANOVA accommodates 3 or more.

The hypothesis statements in ANOVA look like this:

Note: the hypotheses are interpreted as “at least one mean is different”…not that all means are different.

H0: a = b = c

H1: a = b = c

Page 57: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing - ANOVA

SST = SSW + SSB

ij(Xij-X)2 = ij(Xij-Xj)2 + nj(Xj-X)2

SST = Total Sum of Squares

SSW = Sum of Squares Within Groups

SSB = Sum of Squares Between Groups

X = mean of data for all the sample groups combinedXj = mean of the jth sample groupXij = the ith element from the jth groupn = number of samples in each group

__

Page 58: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

FUN MANUAL CALCULATION OF ANOVA!

Page 59: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Question: If we have to test for a difference among a group of three or more samples, why cant we simply execute a series of ttests?

Hint: the issue has to do with the accepted risk of making a TYPE 1 error.

Page 60: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing - ANOVAUsing the Comp1 dataset in EXCEL, determine if there is a difference in satisfaction among the different types of buying situations.

Hint: we will first have to convert the values for the variable “buying situations” from alpha characters to numeric characters…the data is still categorical…EXCEL cannot read alpha characters for ANOVA.

=IF(M2="NEW",1,IF(M2="MOD",2,IF(M2="REP",3)))

Then: TOOLSDATA ANALYSISANOVA: SingleFactor

Note: the input range will be the quant variable and the categorical (numeric) variable ONLY. And…it helps if they are next to each other.

Page 61: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Anova: Single Factor

SUMMARYGroups Count Sum Average Variance

Satisf 100 477.1 4.771 0.731979BuySit 100 200 2 0.686869

ANOVASource of Variation SS df MS F P-value F crit

Between Groups 383.9221 1 383.9221 541.1745 1.53E-58 3.888857Within Groups 140.4659 198 0.709424

Total 524.3879 199

MODULE 2: Inferential Testing - ANOVAYour output should look like this:

This is your p-value. What do you conclude?

SSB SSW

Page 62: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Question: Where are the differences?

Page 63: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing - ANOVAUsing the Comp1 dataset in SAS, determine if there is a difference in satisfaction among the different types of buying situations.

Here is the necessary code:

Proc ANOVA data=Comp1;Class Buysit;

Model Satisf=Buysit;

Means Buysit/TUKEY;Run;

Qualitative Independent/Classification Variable

Quantitative Dependent=Independent

Statement to determine WHICH groups are different

Page 64: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing - ANOVA

Note that this output looks almost the same as the EXCEL output

SSB

SSW

Page 65: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing - ANOVA

Page 66: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing

Any additional questions on ANOVA?

Page 67: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Statistics Workshop Topics

MODULE ONE: Concept Review• Review of Statistical Concepts• Data Analysis using EXCEL

MODULE TWO: Inferential testing using EXCEL and SAS• Confidence Intervals• Ttests• ANOVA• Chi-Square

MODULE THREE: Predictive Modeling using EXCEL and SAS• Regression Analysis• Logistic Analysis• Discriminant Analysis

Page 68: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Dependent Variable

Independent (predictor) Variable

Hypothesis Test

Comments

Categorical (Qualitative)

Categorical (Qualitative)

Chi-Square Tests if variables are statistically independent (i.e. are they related or not?)

Quantitative Categorical (Qualitative)

T-TESTANOVA

Determines if categorical variable (factor) affects dependent variable; Ttests for 1 or 2 groups and ANOVA for 3 or more.

Quantitative

Quantitative or Dummy

Regression Analysis

Test establishes a regression model; used to explain, predict or control dependent variable

Categorical >2 Categories

Quantitative or Dummy

Discriminant Analysis

Test establishes a discriminant model; used to explain, predict or control dependent variable

Binary Quantitative or Dummy

Logistic Analysis

Test establishes a logistic model; used to explain, predict or control dependent variable

MODULE 3: Predictive Modeling

Page 69: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling

All models are wrong…but some are useful.

Page 70: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Correlation Analysis

Prior to any model development, an initial correlation analysis is typically generated to understand which variables are related.

As with any inferential method, Correlation Analysis is conducted within the context of Hypothesis Statements.

The general form of these statements in Correlation Analysis is:

H0: Variable A and Variable B are NOT relatedH1: Variable A and Variable B ARE related

so, if p<a, then the conclusion is that Variable A and Variable B ARE related.

Page 71: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Determine which variables in the Comp1 dataset are correlated using EXCEL.

TOOLSDATA ANALYSISCORRELATION

MODULE 3: Predictive Modeling – Correlation Analysis

Page 72: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

The output is below. What do the numbers mean?

MODULE 3: Predictive Modeling – Correlation Analysis

Page 73: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

These numbers represent the percentage of change in one variable that moves with the change in another variable.

Because these values are percentages, they can vary from a low of negative 1 to a high of positive 1, including 0.

For example, Price Level and Speed have a correlation of -.34923. This means that 35% of the change in Price Level moves with 35% of the change in Speed…and vice versa. In addition, since this number is negative, the change moves in the OPPOSITE direction (i.e. as one goes up, the other goes down).

Note: Correlation does NOT equate to causation. At this point in the analysis, we cannot state that Price Level decreases Speed or vice versa. We can only claim that we know that they have an inverse relationship.

MODULE 3: Predictive Modeling – Correlation Analysis

Page 74: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

0

1

2

3

4

5

6

0 2 4 6 8

Speed

Pri

ce

Le

vel

The scatterplot below is a visual representation of the correlation coefficient -.34923. Is this relationship significant?

MODULE 3: Predictive Modeling – Correlation Analysis

Page 75: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

SAS provides more information regarding the statistical significance of any correlations. The statistical significance will be highly sensitive to the sample size…if the sample is too large…EVERYTHING will appear to be significant. Here is the necessary code:

Proc Corr data=Comp1;Run;

MODULE 3: Predictive Modeling – Correlation Analysis

Page 76: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Correlation Analysis

Note: the number on the top is the correlation coefficient and the number on the bottom is the p-value

Page 77: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Correlation Analysis

Rather than execute a full matrix, if you are only interested in assessing the correlations of the variables with one or two variables of interest (e.g., dependent variables), include the “with” option:

Proc Corr data=Comp1;With Usage;Run;

Page 78: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Correlation Analysis

So…what will you conclude?

Page 79: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Regression

In Regression, we generate a line of the form:

Y = β0 + β1X1 + β2X2 +… βnXn

that represents “best-fit” of the data…meaning that the difference between the actual and the predicted values is minimized. In this equation,

Y = the predicted value (MUST be a ratio scale variable)

B0 = the Y intercept

B1 = the coefficient or weight of X1

X1 = an independent or predictor variable (MUST be a ratio scale variable or a dummy variable)

Page 80: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

To evaluate the “Best fit line”, we calculate the following statistics using ANOVA (again usually done by the computer program):

Total sums of squares (SST), Σ (Yi - Y)2 is a measure of the variability of the dependent variable Y

Error sums of squares (SSE), Σ (Yi - Yi-Pred)2 is a measure of the variability of the dependent variable Y that is left over after using the regression model and the predictor variable X to explain Y. If this value becomes 0, then our Regression model is perfect (usually not the case!)

MODEL(Regression) sums of squares (SSR), SST - SSE, is a measure of the amount of the total sums of squares accounted for by the regression model. The closer SSR is to SST, the better the model explains the variation of Y.

Coefficient of determination, or R2, is calculated as SSR/SST. This is the proportion of the total sums of squares (i.e. total variation in Y) that is explained by the regression model. R2 is close to 1 for good fits and close to 0 for no relationship.

MODULE 3: Predictive Modeling – Regression

Page 81: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Regression

Lets execute a Regression analysis in EXCEL. Use Usage as the dependent (y) variable.

TOOLSDATA ANALYSISREGRESSION

Page 82: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Regression

The dependent variable is the Y RangeAll other quantitative or dummy variables are in the X Range

The Residuals can be helpful

Page 83: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Regression

From this output, what do you know?

Page 84: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Regression

Using this output, answer the following questions:

1. What is the linear equation?2. Which predictor variables are significant?3. How accurate is the model?4. How could you improve the efficiency (parsimony) of

the model?5. Can you use this model to make a prediction?6. Are there any problems with the model? Such as

multicollinearity among the predictors? Or any influential observations?

Page 85: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Regression

Although a Regression model can be developed using EXCEL, the process is fairly manual and there are limited diagnostics.

Page 86: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Regression

Limitations of using EXCEL for Regression modeling:

1. Only supports an “all in” selection method; retention or deletion of variables must be done manually.

2. No diagnostics for collinearity among the predictors.3. No intervals for the predictions.4. Cumbersome identification of outliers.

Page 87: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Regression

Proc Reg data=Comp1;

Model Usage = Speed PriceLv PriceFlx Man_Imag Service

Sal_Imag Quality Size Satisf SpecBuy Procure IndType/

VIF P R CLI Selection=Forward Partial;

Output out=reg p=pred;

Plot Residual.*Pred.;

Run;

Quit;

Here is the necessary code:

Page 88: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Regression

Page 89: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Looks as if Obs 7 might be a problem for us…

MODULE 3: Predictive Modeling – Regression

Page 90: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Any additional questions on Regression?

Page 91: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Dependent Variable

Independent (predictor) Variable

Hypothesis Test

Comments

Categorical (Qualitative)

Categorical (Qualitative)

Chi-Square Tests if variables are statistically independent (i.e. are they related or not?)

Quantitative Categorical (Qualitative)

T-TESTANOVA

Determines if categorical variable (factor) affects dependent variable; Ttests for 1 or 2 groups and ANOVA for 3 or more.

Quantitative

Quantitative or Dummy

Regression Analysis

Test establishes a regression model; used to explain, predict or control dependent variable

Categorical >2 Categories

Quantitative or Dummy

Discriminant Analysis

Test establishes a discriminant model; used to explain, predict or control dependent variable

Binary Quantitative or Dummy

Logistic Analysis

Test establishes a logistic model; used to explain, predict or control dependent variable

MODULE 3: Predictive Modeling

Page 92: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Discriminant Analysis

What if we need to create a model with a CATEGORICAL dependent variable?

Regression Analysis is not a viable option. ANOVA should be executed first. Followed by Discriminant Analysis.

Page 93: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Proc Discrim data=Comp1 method=normal

Pool=Yes List Crossvalidate;

Class BuySit;

Var Speed PriceLv PriceFlx Man_Imag Service

Sal_Imag Quality Size Satisf SpecBuy Procure IndType Usage;

Run;

MODULE 3: Predictive Modeling – Discriminant Analysis

Here is the necessary code to execute Discriminant Analysis:

Page 94: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Discriminant Analysis

The objective in Discriminant Analysis is to minimize the errors in the “Hit Matrix”:

Page 95: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Dependent Variable

Independent (predictor) Variable

Hypothesis Test

Comments

Categorical (Qualitative)

Categorical (Qualitative)

Chi-Square Tests if variables are statistically independent (i.e. are they related or not?)

Quantitative Categorical (Qualitative)

T-TESTANOVA

Determines if categorical variable (factor) affects dependent variable; Ttests for 1 or 2 groups and ANOVA for 3 or more.

Quantitative

Quantitative or Dummy

Regression Analysis

Test establishes a regression model; used to explain, predict or control dependent variable

Categorical >2 Categories

Quantitative or Dummy

Discriminant Analysis

Test establishes a discriminant model; used to explain, predict or control dependent variable

Binary Quantitative or Dummy

Logistic Analysis

Test establishes a logistic model; used to explain, predict or control dependent variable

MODULE 3: Predictive Modeling

Page 96: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Logistic Analysis

Logistic Analysis is a particularly handy modeling technique when the dependent variable can assume two (and only two) values.

Common Applications of Logistic Analysis:

• Will an individual qualify for a loan?• Does this person have a particular disease?• Will this person respond to a marketing solicitation?• Will applicant A be accepted to college?• Will the product fail in a particular situation?

Page 97: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Logistic Analysis

The general form of the Logistic Equation is:

Prob (event)= eB0 + B1X1+B2X2…+BnXn

Prob (no event)

The objective in Logistic Analysis is to correctly predict the presence of an event. The accuracy of the model, is then evaluated using several metrics, including a classification table.

Page 98: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Proc Logistic descending data=Comp1;

model SpecBuy = Speed PriceLv PriceFlx Man_Imag Service Sal_Imag Quality/

Selection=Stepwise

CTable PPROB = (0 to 1 by .1)

Lackfit

RISKLIMITS;

run;

quit;

MODULE 3: Predictive Modeling – Logistic Analysis

Here is the code to execute Logistic Analysis:

Page 99: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Logistic Analysis

Page 100: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 3: Predictive Modeling – Logistic Analysis

Page 101: INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

General Questions on Predictive Modeling?