part one - islamic university of...

Doing Data Analysis With SPSS Dr. Nafez M. Barakat

Doing Data Analysis

With SPSS

Dr: Nafez M. Barakat

2012-2013

1


How to run SPSS program:

Click the left mouse on Start button and select Programs, and from the list choose SPSS 14.0 for Windows. Waite a few minutes before the program is ready for use.

To Show or Hide a Toolbar: From the menu choose: View – toolbar

2

Menu bar

Tool bar

Variables

Cases

Cell editor

Row number

Variable name

Active cell


Opening a Data File: From the menu choose File - OpenIn the open file dialog box, select the file you want to open, and click open

Saving Data FilesFrom the menus choose:File – Save As.Select a file type from the drop-down list (SPSS(*.sav)).Enter a file name for the new data file.

Basic Steps for Data Analysis:Analyzing data with SPSS is easy. All you want to do is:

1. Get data into SPSS.2. Select a procedure.3. select the variables for the analysis.4. Run the procedure and look at the results.

Entering Data into the Data Editor:Many of the feature of the data Editor are similar to those found un spreadsheet applications. There are , however, several important distinctions:

Rows are cases : each row represents a case or observation. For example, each individual respondent a questionnaire is a case.

Columns are a variables: each column represents a variable or characteristic being measured. For example, each item on a questionnaire is a variable.

Cells contain values: each cell contains a single value of a variable for a case. The cell is the intersection of the case and variable.

The data file is rectangular: the dimension of the data file are determined by the number of cases and variables

Example : if we have some questions in a questionnaire like that:

3


Gender male female

Job cat clerical custodial manager

Salary $ ……………..

Enter the data for these questions above in SPSS Data Editor:

At the bottom of the data editor click on the tab Variable View, a different grid appears, with these column headings:

Under Name enter the variable name gender for the first question, jobcat for the second question, and salary for the third question.

Rules apply to variable names: the name must begin with a letter, the remaining characters can be any letter,

any digit, a period, or the symbols @, #,_ , $. Variable names cant end with a period. Blanks and special characters ( for example , !, ?, ', and *) cannot be used. Each variable name must be unique, duplication is not allowed. Variable name

are not case sensitive. The name gender, GENDER, gender are all identical in SPSS.

Some preserved word in SPSS not allowed like not, and, or,….

Define variable type:Click on a small gray button marked with three dots in the type column, you will see this dialog box.

The available data type are: numeric, coma, dot, scientific notation, date, dollar, custom currency, and string.

4


The custom currency format CCA, CCB, CCC, CCD, and CCE are defined in the Currency tab of the Options dialog box, accessed from the edit menu.We select Numeric with width equal 8 digits and Decimal places 0 digit for the variables (jobcat and salary), and string for the variable (gender)

You can change the width and decimal places from the columns named by width and column.

Define Labels:

Define Label provides descriptive variables and can be up to 250 ckaracters long, and these descriptive labels are display in output. We write gender , employment category, and current salary for the three our variables.

Coding Variables:

Click on a small gray button marked with three dots in the values column, you will see this dialog box. Click on a small gray button marked with three dots in the type column, you will see this dialog box.

For the variable gender type f in the value box and type female in the value label box, click add. Then type m in the value box and type male in the value label box, click add.For the variable jobcat type 1 in the value box and type clarical in the value label box, click add. Then type 2 in the value box and type custodial in the value label box, click add, and Then type 3 in the value box and type manager in the value label box, click add.The salary variable is quantitative variable and no value label allowed to it.

Define missing values:

5


Click on a small gray button marked with three dots in the values column, you will see this dialog box.

Define missing values defined specified data as user – missing, and that missing values are excluded from the calculations.

you can enter up three discreet ( individual) missing values, a range of missing values, or range plus one distinct value.

Ranges can only be specified for numeric variables. You cannot define missing values for long string variables.

Define Column Format:

You can defined the width of the column by clicking the mouse on column named by columns , we choose the column width for the three variables equal to 8, and click the align column and choose center.

Measurement of the variables:

Click the mouse on the column named by measure, and choose nominal for the variable gender, order for the variable jobcat, and scale for the variable salary.

6


You can now click on data view and entering the data like that:

Inserting new cases:

To insert a new cases between existing cases: select any cell in the case (row) below the position where uou want to insert the

new case. From the menu choose: data> insert case

A new row is inserted for the case and all variable receive the system- missing value.

Inserting new variable:

To insert a new variable between existing variables: select any cell in the variable (column) to the right of the position where you

want to insert the new variable. From the menu choose: data> insert variable

A new variable is inserted with the system- missing value for all cases.

Moving or remove Variables:

7


Click on the variable you want to remove it at the top of the column. from the menu choose : Edit> Cut. If you want to move the variable , choose: Edit> Pass.

Go To Case:

To Go to Case in the data editor:

make the data editor the active window. From the menu choose : data > Go to Case Enter the data row number for the case and click OK.

Search for Data:

To find a data value in the data editor

Select any cell in the column of the variable you want to search. From the menu choose : edit> find. Enter the data you want to find. Click Fined Next.

Opening a Data File:

Choose File > Open > dataA dialog box like the one shown below:

8


Select the appropriate directory for your system and you will see a list of available worksheet files. Select the one named employee data, then click Open.

Displaying Distributions with Graphs:

The frequency tables and bar charts and pie chars used only for qualitative data or for small data set.The stem-plots, histograms, and time plots will be used for quantitative variables.

Frequency Tables:

To create a frequency table for a categorical variable, fellow these steps. click analyze> Descriptive Statistics> frequencies, the Frequencies dialog box

appears. Click gender from the left rectangle to move it to right rectangle named by

variables (s):

9


Click on chart button, the chart dialog box appears below, click on bar chart and from the Chart Values click on Frequency, finally click Continue.

We return to frequency dialog box, click OK , the resulting SPSS for Windows output appears.

Frequencies

[DataSet1] D:\Program Files\SPSSEval\Employee data.sav

This table summarizes how many observations we have in the dataset; her there are 474 observations, we have a valid data value , and there is no missing data.

Statistics

Gender474

0ValidMissing

N

10


In this table 216 or 45.6% of employee are female, and 258 or 54.4% of employee are male, the cumulative percent is the percentage of the current category plus the percent of the categories above it .

Q. What is the difference between percent and the valid percent ?

Gender

216 45.6 45.6 45.6258 54.4 54.4 100.0474 100.0 100.0

FemaleMaleTotal

ValidFrequency Percent Valid Percent

CumulativePercent

The graph below show the bar chart for gender, and the height of each bar chart

represent the frequency of the employee.

MaleFemale

Gender

300

250

200

150

100

50

0

Freq

uenc

y

Gender

Editing bar charts:

11


double click on the bar chart in the output-SPSS. We get the following chart editor

choose : element > show data label as illustrated below:

the dialog box appear as shown below. Click on text style, write 14 in preferred size box, and then click apply.

12


click file >close to close the chart editor , then the following graph appear below.

MaleFemale

Gender

300

250

200

150

100

50

0

Freq

uenc

y

258216

Gender

Compare between the salary of female and male using bar charts:

graph > interactive > Bar.. the following dialog box appear .

13


complete the dialog box as shown below, and click OK.

Bars show Means

Female Male

Gender

$10,000

$20,000

$30,000

$40,000

Cur

rent

Sal

ary $26,032

$41,442

Q. from the bar chart above : did the males or female have a better salary? Why?

14


Another method for comparing the mean of the salary between male and female

Graph > Bar, the following dialog appear .

Choose Simple and Summaries for groups of cases, click on the button marked define:

Complete the dialog box as shown below and if you are interested in including a title or a footnote on the chart , click Titles and type in the desired information, click continue, return to the original dialog box, click ok

15


A new windows appear, containing bar charts.

MaleFemale

Gender

$50,000

$40,000

$30,000

$20,000

$10,000

$0

Mea

n C

urre

nt S

alar

y

$41,442

$26,032

One variable Descriptive Statistics:

The Frequency procedure provides statistics and histogram graph for quantitative variables as the following :

analyze > descriptive statistics > frequencies, the following dialog box appear. Move the current salary to rectangle named by variable(s).

16


click on the button marked by Statistics , the following dialog appear below; complete the dialog box as shown below , click continue to return to the original dialog box.

percentile values: Values of a salary ( quantitative variable) that divide the ordered data into groups so that a certain percentages is above and another percentage is below. Quartiles (the 25th, 50th, and 75th percentiles) divided the observations into four groups of equal size.

If you want an equal number of groups other than four, select cut points for n equal groups. You can also specify individual percentages( for example , the 77th percentile the value below which 77% of the observations fall).

central Tendency. Statistics that describe the location of the location of the distribution, you can select the mean, median, and mode, or the sum of all the values.

Dispersion: Statistics that measure the amount of variation or spread in the data. You can select the Std. deviation ( Slandered deviation), variance, range, minimum, maximum, or S.E.mean ( standard error of the mean).

Distribution: Statistical that describe the shape and symmetry of the distribution, you can select skewness or kurtosis. These statistics are displayed with there standard errors.

click on the button marked by charts, then click on histogram and on click on the box With normal curve, click continue to return to the original dialog box.

17


A histogram also has bars, but they are plotted along an equal interval scale. The height of each bar is the count of values of quantitative variables falling within the interval. The histogram shows the shape, center, and spread of the distribution. A normal curve superimposed on a histogram helps you judge whether the data are normally distributed.

Click OK to get the following results.

Frequencies

This table show the following results:

Statistics

Current Salary474

0$34,419.57

$784.311$28,875.00

$30,750$17,075.661

2915782142.125

.1125.378.224

$119,250$15,750

$135,000$21,000.00$22,950.00$24,000.00$24,825.00$26,700.00$28,875.00$30,750.00$34,500.00$37,162.50$38,850.00$41,100.00$47,550.00$59,700.00

ValidMissing

N

MeanStd. Error of MeanMedianModeStd. DeviationVarianceSkewnessStd. Error of SkewnessKurtosisStd. Error of KurtosisRangeMinimumMaximum

10202530405060707577808490

Percentiles

18


Q1. Is the distribution symmetric, skewed to the right, skewed to the left ? why ?.Q2. Find the IQR (Inter Quartile Range = Q3 – Q1 = P75 – P25).Q3. you prefer to use range or IQR In this example to determine the dispersion of the data, and why ?

$125,000$100,000$75,000$50,000$25,000$0

Current Salary

120

100

80

60

40

20

0

Freq

uenc

y

Mean =$34,419.57Std. Dev.

=$17,075.661N =474

Histogram

Q4. How would you described the shape of this distribution ?

Descriptives:

The descriptives procedure displays univariate summary statistics for several variables in a single table, and calculates standardized values( z scores). For each variable you can select descriptive computes the mean, standard deviation, minimum, maximum, variance, range, standard error of the mean, and skewness and kurtosis with their standard errors. The median, mode, quartiles and percentiles are not available in descriptive, but they can attained using the frequency procedure variables can be ordered by the size of their means ( in ascending or descending order), alphabetically, or by the order in which you select the variables.

When z scores are saved, they are added to the data in the data editor and are available for SPSS charts, data listings, and analysis. When variables are recorded in different

19


units ( for example, salary , education , and experience) a z-score transformation places variables on a common scale for easier visual comparison.

Example :Open the employee data,sav, and find the descriptives for the current salary and discuss the results.To obtain descriptive statistics:

*From the menus choose: analyze > descriptive statistics > descriptives.., The descriptive dialog box appears, click on the current salary to move in the rectangle named by variable(s), and click on the box " save standardized values as variables, as shown below .

optionally you can click Options for optional statistics and display order, as shown in the descriptive options dialog box below.

Click continue to return to the descriptive dialog box, then click OK, to get these result .

20


Descriptives


Descriptive Statistics

474 474$119,250$15,750

$135,000$16,314,875$34,419.57

$17,075.6612.1255.378

$784.311.112.224

NRangeMinimumMaximumSumMeanStd. DeviationSkewnessKurtosisMeanSkewnessKurtosis

Statistic

Std. Error

Current Salary Valid N (listwise)

Note that we drag an icon from column tray into row tray, and drag an icon from the row tray into the column tray to obtain the result shown above.

EXPLORE:

The explore procedure produces summary statistics either for all of your cases or separately for groups of cases. You can obtain:

graphical dispays, including boxplot, stem and leaf plots, and histograms, with outliers identified.

Frequency tables, percentiles, and other descriptive statistics. Test for normality, including probability plots and Shapiro-Wilk and Lilliefors

tests. Leven's test for assessing equality of variances. Robust estimates of location ( M-estimators).

Reasons for using the explore procedure:

21

Column icon ( salary)Statistics

type

statistics


There are many reasons for using the explore procedure – data screening, outlier identification, description, assumption checking, and characterizing differences among subpopulations ( groups of cases). Data screening may show that you have unusual values, extreme values, gaps in data, or other percentiles, exploring the data may indicate that the distribution of the data is normal or not.

Example: open the file named employee data.sav .

From the menus choose: analyze > descriptive statistics > explore..

The following explore dialog box appear, move the salary variable ( quantitative variable) to the rectangle named by Dependent list.

Click statistics for robust estimator, outliers, percentiles, discriptives, and 95% confidence interval for mean, click continue.

Click plots for histograms, stem-and-leaf, normal probability plots with tests, click continue.

22


click OK, to obtain the following results.

Explore


This table show that we have 474 valid observation, and no missing value present.

Case Processing Summary

474 100.0% 0 .0% 474 100.0%Current SalaryN Percent N Percent N Percent

Valid Missing TotalCases

The table of descriptives shows several statistics .

95% confidence interval for mean ( lower and upper bound): a confidence interval is arrange used to estimate a population mean.

5% trimmed mean: the 5% trimmed sample mean, computed by omitting the highest and lowest 5% of the sample data.

We discussed the other statistics in the previous sections.

23


Descriptives

$34,419.57 $784.311$32,878.40

$35,960.73

$32,455.19$28,875.00291578214

$17,075.661$15,750

$135,000$119,250$13,163

2.125 .1125.378 .224

MeanLower BoundUpper Bound

95% ConfidenceInterval for Mean

5% Trimmed MeanMedianVarianceStd. DeviationMinimumMaximumRangeInterquartile RangeSkewnessKurtosis

Current SalaryStatistic Std. Error

The table M-Estimators shows alternatives to sample mean for estimating the center of location. The estimators calculated differ in the weights they apply to cases. Huber's M-estimator, Tukey's Biweight, Hampel's M-Estimator, and Andrew's Wave estimator are displayed.

M-Estimators

$29,434.84 $27,613.71 $28,739.16 $27,599.33Current Salary

Huber'sM-Estimatora

Tukey'sBiweightb

Hampel'sM-Estimatorc

Andrews'Waved

The weighting constant is 1.339.a.

The weighting constant is 4.685.b.

The weighting constants are 1.700, 3.400, and 8.500c.

The weighting constant is 1.340*pi.d.

The table percentiles displays the values for the 5th , 10th ,25th, 50th, 75th , 90th , and 95th percentiles)

Percentiles

$19,200.00 $21,000.00 $24,000.00 $28,875.00 $37,162.50 $59,700.00 $70,218.75

$24,000.00 $28,875.00 $37,050.00

Current Salary

Current Salary

WeightedAverage(Definition 1)Tukey's Hinges

5 10 25 50 75 90 95Percentiles

The table Extreme Values displays the five smaalest values and the five largest values with case labels.

24


Extreme Values

29 $135,00032 $110,62518 $103,750

343 $103,500446 $100,000378 $15,750338 $15,900411 $16,200224 $16,20090 $16,200

1234512345

Highest

Lowest

Current SalaryCase Number Value

The table tests of normality displays normal probability and detrended normal probability plots. The Kolmogorov-Smirnov statistic, with a Lilliefors significance level for testing normality is displayed. A Shapiro-Wilk statistic calculated for samples with 50 or fewer observation. The significance level equal 0.00 < 0.05 which means that the distribution of the data is not normal.

Tests of Normality

.208 474 .000 .771 474 .000Current SalaryStatistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova Shapiro-Wilk

Lilliefors Significance Correctiona.

We have tow plots, a histogram, and stem-and-leaf, we discussed the histogram plot previously, and now we want to discuss the stem-and-leaf plot:Current Salary

$125,000$100,000$75,000$50,000$25,000

Current Salary

120

100

80

60

40

20

0

Freq

uenc

y


=$17,075.661N =474

Histogram

25


In this output , there are three columns of information, representing frequency, stems, and leaves. The stem width equal 10000, and each leaf contain 3 cases, and the extremes value greater than or equal 56750 $, for example look for the first row, which represent 33 values between 10000 and less that 20000, the values are (15000, 15000, 15000, 16000, 16000, 16000, …, 19000).

q. hoe would you describe the shape of tis distribution ? Compare between histogram and stem=and-leaf, what important difference if any, do you see ?

Current Salary Stem-and-Leaf Plot

Frequency Stem & Leaf

33.00 1 . 56667789999 110.00 2 . 00001111111222222222333334444444444 115.00 2 . 555555556666666667777777778888889999999 80.00 3 . 000000000001111112233333444 32.00 3 . 55556677889 20.00 4 . 0001233& 12.00 4 . 5678& 12.00 5 . 0124& 7.00 5 . 556 53.00 Extremes (>=56750)

Stem width: 10000 Each leaf: 3 case(s)

This plot called normal quantile plot. Any data that follow a normal distribution produce a straight line on the normal quantile plot. Systematic deviations from a straight line indicate a nonnormal distribution. Outliers appear as points that are far away from the overall pattern of the plot.We notes that most points lie far from the straight line, indicating that nonnormal distribution.

26


120,00090,00060,00030,0000

Observed Value

3

2

1

0

-1

-2

-3

Expe

cted

Nor

mal

Normal Q-Q Plot of Current Salary

27


125,000100,00075,00050,00025,0000

Observed Value

4

3

2

1

0

-1

-2

Dev

from

Nor

mal

Detrended Normal Q-Q Plot of Current Salary

This plot called box-plot graph, which illustrate the minimum value, fist quartile (Q1 =P25), second quartile (Q2 = P50), third quartile(Q3 = P75), maximum value, and extreme values ( outliers value) [ we must distinguishes between minor outliers and major outliers, Minor outliers denoted by o in the plot are observation more than 1.5 . IQR outside the central box. Major outliers denoted by * in the plot are observations more than 3*IQR outside the central box.Notes:

if the line represent Q2 ( Median) lie at the middle of the box means the distribution is normal.

if the line represent Q2 ( Median) lie near from Q1 means the distribution is skewed to the right.

if the line represent Q2 ( Median) lie near from Q3 means the distribution is skewed to the left.

28


we can compare between two groups of data using Explore data as follows:

Analysis > Discribtive Statistics > Explore Move salary variable under Dependent list rectangle, and move gender

in Factor List rectangle as shown in the dialog box (Explore).

29

Current Salary

$125,000

$100,000

$75,000

$50,000

$25,000

$0

29

32

1810334

27466198

406

Minimum value

Q1=P25

Q2=P50=MedianQ3=P75

Maximum value

Minor outliers

Major outliers


click statistics and plot button and choose any statistics you want, to get the following results.

Explore


Gender

Case Processing Summary

216 100.0% 0 .0% 216 100.0%258 100.0% 0 .0% 258 100.0%

GenderFemaleMale

Current SalaryN Percent N Percent N Percent

Valid Missing TotalCases

Descriptives

$26,031.92 $514.258$25,018.29

$27,045.55

$25,248.30$24,300.00

57123688$7,558.021

$15,750$58,125$42,375$7,0131.863 .1664.641 .330

$41,441.78 $1,213.968$39,051.19

$43,832.37

$39,445.87$32,850.00380219336

$19,499.214$19,650

$135,000$115,350$22,675

1.639 .1522.780 .302

MeanLower BoundUpper Bound


5% Trimmed MeanMedianVarianceStd. DeviationMinimumMaximumRangeInterquartile RangeSkewnessKurtosisMean

Lower BoundUpper Bound


5% Trimmed MeanMedianVarianceStd. DeviationMinimumMaximumRangeInterquartile RangeSkewnessKurtosis

GenderFemale

Male

Current SalaryStatistic Std. Error

30


M-Estimators

$24,606.10 $24,015.98 $24,419.25 $24,005.82$34,820.15 $31,779.76 $34,020.57 $31,732.27

GenderFemaleMale

Current Salary

Huber'sM-Estimatora

Tukey'sBiweightb

Hampel'sM-Estimatorc

Andrews'Waved

The weighting constant is 1.339.a.

The weighting constant is 4.685.b.

The weighting constants are 1.700, 3.400, and 8.500c.

The weighting constant is 1.340*pi.d.

Percentiles

Weighted Average(Definition 1)

$16,950.00$18,660.00$21,487.50$24,300.00$28,500.00$34,890.00$40,912.50$23,212.50$25,500.00$28,050.00$32,850.00$50,725.00$69,325.00$81,312.50

Percentiles51025507590955102550759095

GenderFemale

Male

Current Salary

31


Extreme Values

371 $58,125348 $56,750468 $55,750240 $54,37572 $54,000

378 $15,750338 $15,900411 $16,200224 $16,20090 $16,20029 $135,00032 $110,62518 $103,750

343 $103,500446 $100,000192 $19,650372 $21,300258 $21,30022 $21,75065 $21,900

12345123451234512345

Highest

Lowest

Highest

Lowest

GenderFemale

Male

Current SalaryCase Number Value

Tests of Normality

.146 216 .000 .842 216 .000

.208 258 .000 .813 258 .000

GenderFemaleMale

Current SalaryStatistic df Sig. Statistic df Sig.



32


Current Salary

Histograms

$60,000$50,000$40,000$30,000$20,000

Current Salary

40

30

20

10

0

Freq

uenc

y

Mean =$26,031.92Std. Dev. =$7,558.021

N =216

Histogram

for gender= Female

$125,000$100,000$75,000$50,000$25,000

Current Salary

100

80

60

40

20

0

Freq

uenc

y


=$19,499.214N =258

Histogram

for gender= Male

33


Stem-and-Leaf Plots

Current Salary Stem-and-Leaf Plot forgender= Female


2.00 1 . 55 16.00 1 . 6666666666777777 14.00 1 . 88889999999999 31.00 2 . 0000000000000111111111111111111 35.00 2 . 22222222222222222222233333333333333 38.00 2 . 44444444444444444444444444555555555555 22.00 2 . 6666666666677777777777 17.00 2 . 88888899999999999 7.00 3 . 0001111 8.00 3 . 22233333 8.00 3 . 44444555 5.00 3 . 66777 2.00 3 . 88 11.00 Extremes (>=40800)


Current Salary Stem-and-Leaf Plot forgender= Male


1.00 1 . & 18.00 2 . 11222344 64.00 2 . 555556666666677777777888889999 60.00 3 . 0000000000000011111112333344 22.00 3 . 5555667899 16.00 4 . 000023& 11.00 4 . 55678& 9.00 5 . 0124& 10.00 5 . 5569& 8.00 6 . 001& 14.00 6 . 56688& 6.00 7 . 03& 5.00 7 . 58 4.00 8 . && 10.00 Extremes (>=86250)


& denotes fractional leaves.

34


Normal Q-Q Plots

60,00050,00040,00030,00020,00010,0000

Observed Value

3

2

1

0

-1

-2

-3

Expe

cted

Nor

mal


for gender= Female

120,00090,00060,00030,0000

Observed Value

3

2

1

0

-1

-2

-3

Expe

cted

Nor

mal


for gender= Male

35


Detrended Normal Q-Q Plots

60,00040,00020,000

Observed Value

2.0

1.5

1.0

0.5

0.0

-0.5

Dev

from

Nor

mal


for gender= Female

125,000100,00075,00050,00025,0000

Observed Value

2.5

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

Dev

from

Nor

mal


for gender= Male

36


MaleFemale

Gender

$125,000

$100,000

$75,000

$50,000

$25,000

$0

Cur

rent

Sal

ary

29

34880

3218103454

413

Q. Compare between male salary and female salary at each table or graph above for the following statistics:

Mean, median, mode, skewness, normality of the distributions, outliers, confidence interval for means.

Bivariate Correlations

The Bivariate Correlation procedure computes Person's correlation coeffient, Spearman's rho and Kendall's tub-b, with their significance levels.Before calculating a correlation coefficient, screen your data for outliers ( which can cause misleading results) and evidence of a linear relationship. Person's correlation coefficient is a measure of linear association. Two variables can be perfectly related, but if the relationship is not linear, person's correlation coefficient is not an appropriate for measuring their association.

Notes: For quantitative , normally distributed variable , use Person's correlation

coefficient. If your data are not normally distributed or have ordered categories, use

Sperman's rho or Kendall's tau-b, which measure the association between rank orders.

Correlation coefficient range from -1 (a perfect negative relationship) and +1 ( a perfect positive relationship). A value of 0 indicates no linear relationship.

37


You can choose two-tailed probabilities , or one-tailed probabilities. If the direction of association is known in advance, choose one-tiled. Otherwise, choose two-tailed.

Correlation coefficients significant at the 0.05 level are identified with a single asterisk, and those significant at the 0.01 level are identified with two asterisks.

Example : open the file named by employee data.sav

* From the menus choose : analyze > correlate > bivariat

the Bivariate Correlations dialog box appears, move the variable ( salary, educ, Jobtime, months) to rectangle named Variables

click on Person, Kendall's tau-b, Spearman, and click OK.

The following results appears

Note that the variables are non-normal distribution, so we must use Spearman coefficient.

38


Correlations


Correlations

1 -.097* .084 .661**.034 .067 .000

474 474 474 474-.097* 1 .003 -.252**.034 .948 .000474 474 474 474

.084 .003 1 .047

.067 .948 .303474 474 474 474

.661** -.252** .047 1

.000 .000 .303474 474 474 474

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

Current Salary

Previous Experience(months)

Months since Hire

Educational Level (years)

Current Salary

PreviousExperience(months)

Monthssince Hire

EducationalLevel (years)

Correlation is significant at the 0.05 level (2-tailed).*.

Correlation is significant at the 0.01 level (2-tailed).**.

Nonparametric Correlations


Correlations

1.000 -.023 .105* .688**. .625 .023 .000

474 474 474 474-.023 1.000 .008 -.121**.625 . .856 .008474 474 474 474

.105* .008 1.000 .051

.023 .856 . .273474 474 474 474

.688** -.121** .051 1.000

.000 .008 .273 .474 474 474 474

Correlation CoefficientSig. (2-tailed)NCorrelation CoefficientSig. (2-tailed)NCorrelation CoefficientSig. (2-tailed)NCorrelation CoefficientSig. (2-tailed)N

Current Salary

Previous Experience(months)

Months since Hire

Educational Level (years)

Spearman's rhoCurrent Salary

PreviousExperience(months)

Monthssince Hire

EducationalLevel (years)

Correlation is significant at the 0.05 level (2-tailed).*.

Correlation is significant at the 0.01 level (2-tailed).**.

Q1. Is there a relationship between education and salary variables, ?Q2. Is there a relationship between education and months since hire? variables, ?Q3. Is there a relationship between education and Previous Experience variables, ?

Note: we can graph a scatter plot between any two variables for two different groups as follows:

graph > interactive > scatterplot.. as follow:

39


complete the create scatterplot dialog as shown below

Click on Fit and choose regression method.

Click OK, OK, the results follows,

40


Interactive Graph


Linear Regression

$20,000 $40,000 $60,000 $80,000

Beginning Salary

$40,000

$80,000

$120,000

$160,000

Cur

rent

Sal

ary

Current Salary = 438.51 + 1.95 * salbeginR-Square = 0.58

Female Male

$20,000 $40,000 $60,000 $80,000

Beginning Salary

Current Salary = 4083.08 + 1.84 * salbeginR-Square = 0.74

41


Selecting cases

In this section we demonstrates how SPSS can be used to select n cases from finite population of interest using simple random sampling.

Example : select the cases related to males students .

Choose from the menu:

Data Select Cases...

Select If condition is satisfied.

Click If.

Select gender to pasted in the Expression area.

42


Select "=" on the calculator pad.

To complete the expression, type 1

Click Continue.

43


Click OK in the Select Cases dialog box.

The figure below show the 10 cases.

Remarks:

1: if we want to select random cases we follow this procedures:



Random sample of cases

Sample,

If we you want to select 50% from the cases randomly , write 50 inside the box (approximately)

44


And if you want to select 5 cases from the first 10 cases write in the box exactly, and write 10.

Click OK in the Select Cases dialog box

2. if you want to select cases that fall within the encusive case( row) range or date/time range. Date and time ranges are only available for time-series data with defined data variables ( Data menue, Define Date). All values must be positive integers.



Random sample of cases

Sample,

Based on time or case range

Range

45


If you want to select from the third cases to tenth cases , write 3 in the box " first case" and 10 in the box " last case"

Click Continue.

Click OK in the Select Cases dialog box.

46


Inference for

Distributions

Hypothesis Test for One Population Mean

One sample t test for population MeanDefinition : The One-Sample T Test compares the mean score of a sample to a known value. Usually, the known value is a population mean.

Definition : Null hypotheses and Alternative hypothesisNull hypotheses : a hypothesis to be tested, We use the symbol H0 to represent the null hypothesis.Alternative hypothesis: a hypothesis to be conceder as alternative to null hypothesis, We use the symbol Ha to represent the alternative hypothesis.Hypotheses:Null: There is no significant difference between the sample mean and the population mean.Alternate: There is a significant difference between the sample mean and the population mean

47


We present two step by step procedure for performing a one sample t-test. Procedure (I) covers the critical-value approach, and Procedure (II) covers the p-value approach.

One sample t test for population Mean

(critical-value approach)Assumptions

1. Normal population or large sample

2. unknown

Step 1: the null hypothesis is and the alternative hypothesis is

Step 2 : decide on the significance level, Step 3: compute the value of the test statistic

Step 4: the critical value (s) are or

or with degrees of freedom (df= n-1)

48


Step 5 : if the value of the t test statistics falls in the rejection region, reject HO ; otherwise, fail to reject H0 Step 6 : interpret the results of the hypothesis test.

Example : table below show the pH levels for 15 lakes; test if the lakes has pH greater than 6 at 5% significant level.( use the critical value approach)

7.2 7.3 6.1 6.9 6.6 7.3 6.3 5.56.3 6.5 5.7 6.9 6.7 7.9 5.8

Solution : Step 1: state the null and alternative hypotheses

( mean PH Level is not greater than 6) (mean PH Level is greater than 6)


49


Step 4: the critical value for a right-tailed test is (from table) with df = 15-1 = 14

Step 5: the value of the test statistic, found in step 3 is T=3.458 fail in the rejection region. Consequently , we reject HO

One sample t test for population Mean

(P-Value Approach)Assumptions

3. Normal population or large sample

4. unknown



Step 4: find the p-value by using tablewith degrees of freedom (df= n-1)

50


Step 5 : if the P- value less than or equal , ( ), reject HO ; otherwise, fail to reject H0 Step 6 : interpret the results of the hypothesis test.

51


Example : table below show the pH levels for 15 lakes; test if the lakes has pH greater than 6 at 5% significant level. ( use the p-value Approach)Solution :

Step 1: state the null and alternative hypotheses ( mean PH Level is not greater than 6)

(mean PH Level is greater than 6)Step 2 : decide on the significance level, Step 3: compute the value of the test statistic

Step 4: the p-value = p ( t>= 3.458) = 0.00192 (with df = 15-1 = 14 )

Step 5: p value < 0.05) so we reject HO

Interval Estimation

Interval Estimation of a Population Mean: with s Unknown

Interval Estimate

where 1 -a = the confidence coefficient

52


ta/2 = the t value providing an area of a/2 in the upper tail of a t

distribution with n - 1 degrees of freedom s = the sample standard deviation

n = sample size

example : suppose that we have a sample employees salary with the following information : n = 10, mean = $550, slandered deviation = $60, we want to estimate a 95% confidence interval of the mean, assume this population to be normally distributed:solution : At 95% confidence, 1 - a = .95, a = .05, and a/2 = .025.

t.025 is based on n - 1 = 10 - 1 = 9 degrees of freedom.

In the t distribution table we see that t.025 = 2.262Interval Estimation of a Population Mean:

= 550 + 42.92or $507.08 to $592.92

We are 95% confident that the mean salary of the population is between $507.08 and $592.92.

use SPSS programexample 1: use the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below

53


Step 3 : the result shown belowOne-Sample Statistics

15 6.600 .6719 .1735PHN Mean Std. Deviation

Std. ErrorMean

54


One-Sample Test

3.459 14 .004 .600 .228 .972PHt df Sig. (2-tailed)

MeanDifference Lower Upper

95% ConfidenceInterval of the

Difference

Test Value = 6

example 2: use spa file called training to test if the mean of training time equal 60 days, also find 95% confidence interval for the mean population

solution :

Step 1: state the null and alternative hypotheses ( mean training equal 60 days)

(mean training not equal 60 days)Step 2 : decide on the significance level, Step 3: compute the value of the test statistic, from output t = -3.482

Step 4: the p-value = 2*p ( t>= 3.482) = 0.004 (with df = 15-1 = 14 )Step 5: the value of the test statistic, found in step 3 is T=-3.482 fail in the rejection region (-2.14, 2.14). Consequently , we reject HO

or the p-value =0.004 < 0.05 so we reject HO

55


SPSS output :One-Sample Statistics

15 53.87 6.823 1.762TIMEN Mean Std. Deviation

Std. ErrorMean

One-Sample Test

-3.482 14 .004 -6.13 -9.91 -2.35TIMEt df Sig. (2-tailed)

MeanDifference Lower Upper

95% ConfidenceInterval of the

Difference

Test Value = 60

95% confidence interval for the mean population

56


SPSS OUTPUT

95% Confidence Interval for Mean Lower Bound 50.09

Upper Bound 57.65

=[50.09, 57.65]NOTE that the mean test = 60 not include in the C.I so we reject null hypotheses

NONPARAMETRIC TESTUse Sign Test (Binomial Test)

57


Hemoglobin

14.013.012.011.010.09.08.07.06.0

Histogram

Freq

uenc

y

6

5

4

3

2

1

0

Std. Dev = 2.15

Mean = 8.9

N = 15.00

Tests of Normality

.236 15 .024 .858 15 .023HemoglobinStatistic df Sig. Statistic df Sig.



58


59


Binomial Test

<= 8 4 .29 .50 .180> 8 10 .71

14 1.00

Group 1Group 2Total

HemoglobinCategory N

ObservedProp. Test Prop.

Exact Sig.(2-tailed)

60


61


Inference For Two Population MeanThe pooled t test for two population means(critical-value approach)Assumptions1.independent samples2.normal populations or large samples3.equal population standard deviations



Where

Step 4: the critical value (s) are or

or with degrees of freedom (df= n1 + n2 -2)

62



The pooled t test for two population means(p-value approach)Assumptions1.independent samples2.normal populations or large samples3.equal population standard deviations



63


Where

Step 4: the value of t-statistics has df= n1 + n2 -2. Use a table to estimate the p-value or obtain it exactly by using technology.

64


Step 5 : if the P- value less than or equal , ( ), reject HO ; otherwise, fail to reject H0 Step 6 : interpret the results of the hypothesis test. Example : we perform a hypotheses test to decide whether there is a difference between the mean salaries of faculty in public and private institutions. Independent random samples of 20 faculty members in public institutions and 35 faculty members in private institutions yield in the data in table below. At the 5% significance level, do the data provide sufficient evidence to conclude that means salaries for faculty in public and private institutions differ?

Annual salary ($1000s)for 30 faculty members in public institutions and 35 faculty members in private institutions

Sample 1 (public institutions) Sample 2 (private institutions)34.2 56.8 58.2 29.2 60.2 92.9 62.9 45.2 66.3 47.2 71.090.0 41.4 76.8 15.8 88.2 52.0 53.8 76.0 31.1 59.3 97.3100.4 35.0 84.2 33.8 44.6 63.1 101.0 56.1 71.1 97.5 92.624.6 54.2 79.4 40.2 64.4 118.5 68.6 77.6 73.5 27.2 56.0107.4 24.4 42.2 51.2 74.0 37.7 51.5 61.6 67.6 81.2 62.363.6 56.0 81.8 41.2 71.0 102.2 46.4 78.3 52.4 24.8

Solution:Summary statistics for the samples

public institutions private institutions

Step 1: statethe null hypothesis and the alternative hypothesis ( mean salaries are the same)

( mean salaries are the different) Step 2 : decide on the significance level,

Step 3: compute the value of the test statistic

Where

65


Critical-value approachStep 4: the critical value (s) are

with degrees of freedom (df= n1 + n2 -2)

From a table the critical values wit (df = 30+35-2=63) are

Step 5 : if the value of the t test statistics falls in the rejection region, reject HO ; otherwise, fail to reject H0 From step 3 the value of the test statistics is t =-1.554, which does not fall in the rejection region, thus we do not reject HO .Step 6 : interpret the results of the hypothesis test.at 5% significance level, the data do not provide sufficient evidence to conclude that a difference exist between the mean salaries of faculty in public and private institutions . p-value approachStep 4: from a table(with df = 63) the p-value ( in two tailed) greater than 0.1 and less than 0.20 ( 0.1 < p < 0.2) , and by using technology, we obtain the p-value = 2 p ( t>= 1.554) = 0.125 (with df = 63)

66


Step 5: p value < 0.05) so we reject HO

at 5% significance level, the data do not provide sufficient evidence to conclude that a difference exist between the mean salaries of faculty in public and private institutions . use SPSS programuse the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below

67


Step 3 : the result shown below

68


Tests of Normality

.080 35 .200* .980 35 .755

.105 30 .200* .975 30 .680

TYPEPRIVPUBL

SALARYStatistic df Sig. Statistic df Sig.


This is a lower bound of the true significance.*.


Group Statistics

30 57.480 23.9528 4.373235 66.394 22.2611 3.7628

TYPEPUBLPRIV

SALARYN Mean Std. Deviation

Std. ErrorMean

Independent Samples Test

.458

.501-1.554 -1.545

63 59.853.125 .128

-8.914 -8.914

5.7363 5.7692

-20.3774 -20.45492.5488 2.6264

FSig.

Levene's Test forEquality of Variances

tdfSig. (2-tailed)Mean Difference

Std. Error Difference

LowerUpper

95% Confidence Intervalof the Difference

t-test for Equality ofMeans

SALARY

Equal variancesassumed

Equal variancesnot assumed

Mann-Whitney-Wilcoxon Test This test is another nonparametric method for determining

whether there is a difference between two populations.

This test, unlike the Wilcoxon signed-rank test, is not based on a matched sample.

This test does not require interval data or the assumption that both populations are normally distributed.

The only requirement is that the measurement scale for the data is at least ordinal

69


Instead of testing for the difference between the means of two populations, this method tests to determine whether the two populations are identical.

The hypotheses are:

H0: The two populations are identicalHa: The two populations are not identical

Example : consider the independent of these tow groups as follows:

Rank the data from the lowest to highest as the following table

70


So U = 7Making decisionWe reject Ho when the value of U less than or equal critical value UO

from table of Mann-Whitney UO = 3 , so we fail to reject the null hypotheses , that means

Notes:1.When the alternative hypothesis we reject HO IF U1 <UO

2.When the alternative hypothesis we reject HO IF U2 <UO

3.when the sample size to one sample or both are large > 20, we use the standardized normal distribution as the following:

Where and

And we reject Ho when the absolute value of Zcal greater that critical value Ztab

use SPSS programuse the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below

71


Step 2 : the result shown belowMann-Whitney Test

Ranks

4 4.25 17.005 5.60 28.009

TYPEgroup Egroup CTotal

SCORESN Mean Rank Sum of Ranks

72


Test Statisticsb

7.00017.000

-.735.462

.556a

Mann-Whitney UWilcoxon WZAsymp. Sig. (2-tailed)Exact Sig. [2*(1-tailedSig.)]

SCORES

Not corrected for ties.a.

Grouping Variable: TYPEb.

73


74


75


76


77


Inference For Two Population MeanThe paired t test for two population means(critical-value approach)Assumptions1.paird samples2.normal populations or large samples



Where where d = paired differenceStep 4: the critical value (s) are

or or

with degrees of freedom (df= n-1)

78



The paired t test for two population means(p-value approach)Assumptions

79


1.independent samples2.normal populations or large sample



Where where d = paired difference

Step 4: the value of t-statistics has df= n-1. Use a table to estimate the p-value or obtain it exactly by using technology.

80


Step 5 : if the P- value less than or equal , ( ), reject HO ; otherwise, fail to reject H0 Step 6 : interpret the results of the hypothesis test. Example : the gas mileages of 10 randomly selected cars, both with and without a new gasoline additive, are displayed in the second and third columns in table below.At the 5% significance level, do the data provide sufficient evidence to conclude that, on average, the gasoline additive improves gas mileage?

Car Gas mileage with additive

Gas mileage without additive

Paired difference

1 25.7 24.9 0.82 20.0 18.8 1.23 28.4 27.7 0.74 13.7 13.0 0.75 18.8 17.0 1.86 12.5 11.3 1.27 28.4 27.8 0.68 8.1 8.2 -0.19 23.1 23.1 0

10 10.4 9.9 0.5

81


Solution:Step 1: state the null hypothesis and the alternative hypothesis denote the mean gas mileage when the additive is used

denote the mean gas mileage when the additive is not used ( mean gas mileage with additive is not greater)

(mean gas mileage with additive is greater) Note that the hypothesis test is right-tailed because a greater than sign (>) appears in the alternative hypothesis. Step 2 : decide on the significance level,

Step 3: compute the value of the test statistic

Critical-value approachStep 4: the critical value (s) are

with degrees of freedom (df= n-1)

From a table the critical values wit (df = 10-1=9) are

Step 5 : if the value of the t test statistics falls in the rejection region, reject HO ; otherwise, fail to reject H0

82


From step 3 the value of the test statistics is t =4.134, which fall in the rejection region, thus we do reject HO .Step 6 : interpret the results of the hypothesis test.at 5% significance level, the data provide sufficient evidence to conclude that, the gasoline additive improves gas mileage p-value approachStep 4: from a table(with df = 10-1) the p-value ( in right tailed) is the probability of observing a value of t of 4.134 or greater, we find that p < 0.005 ( using technology, we obtain p = .0013)Step 5: p value < 0.05) so we reject HO

at 5% significance level, the data provide sufficient evidence to conclude that, the gasoline additive improves gas mileage use SPSS programuse the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below

83



T-TestPaired Samples Statistics

18.9100 10 7.47209 2.3628818.1700 10 7.42848 2.34909

ADDITIWITH.ADD

Pair1

Mean N Std. DeviationStd. Error

Mean

Paired Samples Correlations

10 .997 .000ADDITI & WITH.ADDPair 1N Correlation Sig.

Paired Samples Test

.7400 .56608 4.134 9 .003ADDITI - WITH.ADDPair 1Mean Std. Deviation

Paired Differencest df Sig. (2-tailed)

84


Example:

85


86


87


From wilcoxon signed ranks table we fined the critical value at N=14 and , so W = 6 < 21 so we reject null

hypothesis

Note : if n > 15 we can use normal distribution for testing wilcoxon signed ranks where the Z statistic as follow:

Where

And we reject H0 if absolute value of Zcal > critical value Ztab

Example using SPSS program

88


Wilcoxon Signed Ranks Test

89


Ranks

3a 2.33 7.009b 7.89 71.000c

12

Negative RanksPositive RanksTiesTotal

DRUG_B - DRUG_AN Mean Rank Sum of Ranks

DRUG_B < DRUG_Aa.

DRUG_B > DRUG_Ab.

DRUG_A = DRUG_Bc.

Test Statisticsb

-2.511a

.012ZAsymp. Sig. (2-tailed)

DRUG_B -DRUG_A

Based on negative ranks.a.

Wilcoxon Signed Ranks Testb.

90


SPSS output

Wilcoxon Signed Ranks TestRanks

56a 79.35 4443.5089b 69.01 6141.5062c

207


Cases per 100,000population, 1993 -Cases per 100,000population, 1992

N Mean Rank Sum of Ranks

Cases per 100,000 population, 1993 < Cases per 100,000 population, 1992a.

Cases per 100,000 population, 1993 > Cases per 100,000 population, 1992b.

Cases per 100,000 population, 1992 = Cases per 100,000 population, 1993c.

Test Statisticsb

-1.678a


Cases per100,000

population,1993 - Casesper 100,000population,

1992

Based on negative ranks.a.


91


92


Wilcoxon Signed Ranks TestRanks

11a 6.09 67.001b 11.00 11.000c

12


Pronethalol - PlaceboN Mean Rank Sum of Ranks

Pronethalol < Placeboa.

Pronethalol > Placebob.

Placebo = Pronethalolc.

93


Test Statisticsb

-2.201a


Pronethalol -Placebo

Based on positive ranks.a.


Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA) can be used to test for the equality of three or more population means using data obtained from observational or experimental studies.

We want to use the sample results to test the following hypotheses.

H0: 1 = 2 = 3 = . . . = k Ha: Not all population means are equal

If H0 is rejected, we cannot conclude that all population means are different.

Rejecting H0 means that at least two population means have different values.

Assumptions for Analysis of Variance

1. For each population, the response variable is normally distributed.

2. The variance of the response variable, denoted , is the same for all of the populations.

3. The observations must be independent

Between-Samples Estimate of Population Variance

94


The numerator of MSB is called the sum of squares between (SSB).

he denominator of MSB represents the degrees of freedom associated with SSB.

Within-Samples Estimate of Population Variance The estimate of based on the variation of the sample observations within each sample is called the mean square within (MSW).

The numerator of MSW is called the sum of squares within (SSW). The denominator of MSW represents the degrees of freedom associated with SSW.

Comparing the Variance Estimates: The F Test If the null hypothesis is true and the ANOVA

assumptions are valid, the sampling distribution of MSB/MSW is an F distribution with MSB d.f. equal to k - 1 and MSW d.f. equal to nT - k.

If the means of the k populations are not equal, the value of MSB/MSW will be inflated because MSB overestimates .

Hence, we will reject H0 if the resulting value of MSB/MSW appears to be too large to have been selected at random from the appropriate F distribution.

Test for the Equality of k Population Means

Hypotheses

95


H0: 1 = 2 = 3 = . . . = k Ha: Not all population means are equal

Test StatisticF = MSB/MSW

Rejection Rule

Reject H0 if F > Fwhere the value of F is based on an F distribution with k

- 1 numerator degrees of freedom and nT - 1 denominator degrees of freedom.Sampling Distribution of MSTR/MSE The figure below shows the rejection region associated with a level of significance equal to where F denotes the critical value.

The ANOVA Table

SST divided by its degrees of freedom nT - 1 is simply the overall sample variance that would be obtained if we treated the entire nT observations as one data set.

96


Example: Reed ManufacturingWe would like to know if the mean number of hours worked per week is the same for the department managers at her three manufacturing plants (Buffalo, Pittsburgh, and Detroit).

A simple random sample of 5 managers from each of the three plants was taken and the number of hours worked by each manager for the previous week is shown on ONE WAY ANOVA FILE.

Hypotheses

H0: Ha: Not all the means are equal

where: = mean number of hours worked per week by the managers at Plant 1

= mean number of hours worked per week by the managers at Plant 2= mean number of hours worked per week by the managers at Plant 3

• Mean Square Between

Since the sample sizes are all equal = (55 + 68 + 57)/3 = 60

SSB = 5(55 - 60)2 + 5(68 - 60)2 + 5(57 - 60)2 = 490 MSB = 490/(3 - 1) = 245

• Mean Square Within

97


SSW = 4(26.0) + 4(26.5) + 4(24.5) = 308 MSW = 308/(15 - 3) = 25.667

• F – Test

If H0 is true, the ratio MSB/MSW should be near 1 since both MSB and MSW are estimating . If Ha is true, the ratio should be significantly larger than 1 since MSB tends to overestimate .

• Rejection Rule

Assuming = .05, F.05 = 3.89 (2 d.f. numerator, 12 d.f. denominator). Reject H0 if F > 3.89

• Test Statistic

F = MSB/MSW = 245/25.667 = 9.55• Conclusion

F = 9.55 > F.05 = 3.89, and P-VALUE = P( F> 9.55) = 0.0033 < 0.05 , so we reject H0. The mean number of hours worked per week by department managers is not the same at each plant.

• ANOVA Table

Multiple Comparison ProceduresSuppose that analysis of variance has provided statistical evidence to reject the null hypothesis of equal population means. Fisher’s least significance difference (LSD) procedure can be used to determine where the differences occur.

Hypotheses

H0:

98


Ha: Test Statistic

Rejection Rule

Reject H0 if t < -ta/2 or t > ta/2

where the value of ta/2 is based on a t distribution with nT - k degrees of freedom

Fisher’s LSD Procedure Based on the Test Statistic n Hypotheses

H0: Ha:

n Test Statistic

n Rejection Rule

Reject H0 if | | > LSD

Where

Example: Reed Manufacturing

Fisher’s LSD

Assuming a = .05,

Hypotheses (A) H0: Ha:

99


Test Statistic | | = |55 - 68| = 13

• Conclusion

The mean number of hours worked at Plant 1 is not equal to the mean number worked at Plant 2.

Fisher’s LSD

Assuming a = .05,

• Hypotheses (B)

H0: Ha:

• Test Statistic

| | = |55 - 57| = 2• Conclusion

There is no significant difference between the mean number of hours worked at Plant 1 and

the mean number of hours worked at Plant 3.

Fisher’s LSD

Hypotheses (C) H0:

Ha: • Test Statistic

| | = |68 - 57| = 11• Conclusion

The mean number of hours worked at Plant 2 is not equal to the mean number worked at Plant 3.

100


SPSS PROCEGERS Open file named One WAY ANOVA

101


SPSS output Descriptives

SCORES

5 55.00 5.099 2.280 48.67 61.33 48 625 68.00 5.148 2.302 61.61 74.39 63 745 57.00 4.950 2.214 50.85 63.15 51 63

15 60.00 7.550 1.949 55.82 64.18 48 74

buffaloPITTSBURGHdetroitTotal

N Mean Std. Deviation Std. Error Lower Bound Upper Bound

95% Confidence Interval forMean

Minimum Maximum

ANOVA

SCORES

490.000 2 245.000 9.545 .003308.000 12 25.667798.000 14

Between GroupsWithin GroupsTotal

Sum ofSquares df Mean Square F Sig.

Post Hoc Tests

102


Multiple Comparisons

Dependent Variable: SCORESLSD

-13.00* 3.204 .002 -19.98 -6.02-2.00 3.204 .544 -8.98 4.9813.00* 3.204 .002 6.02 19.9811.00* 3.204 .005 4.02 17.982.00 3.204 .544 -4.98 8.98

-11.00* 3.204 .005 -17.98 -4.02

(J) GROUPPITTSBURGHdetroitbuffalodetroitbuffaloPITTSBURGH

(I) GROUPbuffalo

PITTSBURGH

detroit

MeanDifference

(I-J) Std. Error Sig. Lower Bound Upper Bound95% Confidence Interval

The mean difference is significant at the .05 level.*.

Example : energy consumption

Independent random samples of household in four regions yielded the data on last year's energy consumptions shown in table below.At 5% significance level, do the data provide sufficient evidence to conclude that a difference to conclude that a difference exist in the last year's mean energy consumption by households among the four regions.

Northeast

Midwest

south west

15 17 11 1010 12 7 1213 18 9 814 13 13 713 15 9

12

solution :

1. State the null and alternative hypotheses

103


( mean energy consumptions are equal)Ha : not all the means are equal.

2. Decide on significance level,

We are perform the hypothesis test at the 5% significance level, consequently ,

3. Compute F statistic and critical value

From the results F statistics = 6.32, and from the table the value of critical value for F = 3.24 at df|(k-1,n-k) where k = numbers of population, n = total number of observations

4.conclution if the value of the F-Statistics falls in the rejection region, or the p-value less than , reject H0 , otherwise don’t reject H0

F-statistics = 6.32 > Ttabulated ( falls in the rejection region) , and the p-value = 0.00495 < 0.05 , so we reject H0

5.Interpret the resultsAt 5% significance level, the data provides sufficient evidence to conclude that difference exist in the last year's mean energy consumption by households among the four regions.use SPSS programuse the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below

104


105



106


Descriptives

ENERGY

5 13.00 1.871 .837 10.68 15.326 14.50 2.588 1.057 11.78 17.224 10.00 2.582 1.291 5.89 14.115 9.20 1.924 .860 6.81 11.59

20 11.90 3.076 .688 10.46 13.34

northeastmidwestsouthwestTotal

N Mean Std. Deviation Std. Error Lower Bound Upper Bound

95% Confidence Interval forMean

Test of Homogeneity of Variances

ENERGY

.844 3 16 .490

LeveneStatistic df1 df2 Sig.

ANOVA

ENERGY

97.500 3 32.500 6.318 .00582.300 16 5.144

179.800 19

Between GroupsWithin GroupsTotal


Post Hoc Tests

107


Multiple Comparisons

Dependent Variable: ENERGY

-1.50 1.373 .7573.00 1.521 .3103.80 1.434 .1121.50 1.373 .7574.50 1.464 .0545.30* 1.373 .013

-3.00 1.521 .310-4.50 1.464 .054

.80 1.521 .963-3.80 1.434 .112-5.30* 1.373 .013-.80 1.521 .963

-1.50 1.348 .8773.00 1.538 .4863.80 1.200 .0771.50 1.348 .8774.50 1.668 .1805.30* 1.363 .022

-3.00 1.538 .486-4.50 1.668 .180

.80 1.551 .997-3.80 1.200 .077-5.30* 1.363 .022-.80 1.551 .997

(J) REGIONmidwestsouthwestnortheastsouthwestnortheastmidwestwestnortheastmidwestsouthmidwestsouthwestnortheastsouthwestnortheastmidwestwestnortheastmidwestsouth

(I) REGIONnortheast

midwest

south

west

northeast

midwest

south

west

Scheffe

Tamhane

MeanDifference

(I-J) Std. Error Sig.

The mean difference is significant at the .05 level.*.

108


109


110


SPSS PROCEGERS :1. ENTER THE DATA

111


SPSS POTPUTS Kruskal-Wallis Test

Ranks

5 8.005 11.106 16.176 10.08

22

GROUPABCDTotal

SCORESN Mean Rank

Test Statisticsa,b

4.8613

.182

Chi-SquaredfAsymp. Sig.

SCORES

Kruskal Wallis Testa.

Grouping Variable: GROUPb.

112


P-VALUE =P( = 0.1823 < 0.05 , so we fail to reject Ho

113


114


SPSS OUTPUTS Kruskal-Wallis Test

Ranks

46 138.8245 142.3122 71.5048 89.6711 54.0935 64.76

207

WHO RegionAfricaAmericasEastern MediterraneanEuropeSouth-East AsiaWestern PacificTotal

Cases per 100,000population, 1993

N Mean Rank

Test Statisticsa,b

67.2525

.000

Chi-SquaredfAsymp. Sig.

Cases per100,000

population,1993

Kruskal Wallis Testa.

Grouping Variable: WHO Regionb.

115


Critical = 5.99p-value = p( > 5.22)= .0735 > 0.05So we fail to reject Ho

Contingency tables association (chi-square test)

116


Chi square independence test used to decide, based on the samples data, whether two variable of a population are statistically related

Assumptions:1. All expected frequencies are 1 or greater2. At most 20% of all expected frequencies are

less than 53. The variables must be categorical data

Step 1: the null and alternative hypotheses are

H0 : The two variables under consideration are not associated

Ha : The two variables under consideration are associated

Step 2: calculate the expected frequency by using the formula

Where R= Row total, C= column total, and n= sample size, place each expected frequency below its corresponding observed frequency in the contingency table.

Step 3: determine whether the expected frequency satisfy assumptions 1 ,2, and 3. If they do not, this procedure should not be used.Step 4 : decide on the significance level

117


Step 5: compute the value of the test statistics

Where E and O represent expected and observed frequencies, respectively.

Step 6 : the critical value is with df =(r-1)(c-1), where r and c are the number of categories in each variables , use table to fined the critical value.

Step 7: if the value of the test statistics falls in the rejection region, or the P-Value reject H0 , otherwise, do not reject H0.

Step 7: interpret the results of the hypothesis test.

EXAMPLE : A random sample of 1772 U.S. adults yielded the data on marital status and alcohol consumption displayed in table below. At 5% significance level, do the data provide sufficient evidence to conclude that an association exist between martial statues and alcohol consumption?

118


Cross tabulationDrinks per month

Abstain 1-60 Over 60 Total

Mar

ital

stat

usSingle 67 213 74 354Married 411 633 129 1173Widowed 85 51 7 143Divorced 27 60 15 102Total 590 957 225 1772

Solution :Step 1: the null and alternative hypotheses are

H0 : marital status and alcohol consumptions are not associated

Ha : marital status and alcohol consumptions are associated

Step 2: calculate the expected frequency by using the formula

Where R= Row total, C= column total, and n= sample size, place each expected frequency below its corresponding observed frequency in the contingency table.

Cross tabulationDrinks per month

Abstain 1-60 Over 60 Total

Mar

ital

Single O= 67E= 117.9

213191.2

7444.9

354

119


stat

usMarried 411

390.6633

633.5129

148.9 1173

Widowed 8547.6

5177.2

718.2 143

Divorced 2734.0

6055.1

1513.0 102

Total 590 957 225 1772

Step 3: determine whether the expected frequency satisfy assumptions 1 ,2, and 3. If they do not, this procedure should not be used.

1. all expected frequencies are greater than 12. non of the expected frequency less than 5 (0%

of cell have expected frequency less than 5)3. the two variable are categorical

Step 4 : decide on the significance level

Step 5: compute the value of the test statistics

Where E and O represent expected and observed frequencies, respectively.

Step 6 : the critical value is with df =(r-1)(c-1), where r and c are the number of categories in each variables , use table to fined the critical value.

120


R=4, c=3, so df=6, and from table

Step 7: if the value of the test statistics falls in the rejection region, or the P-Value reject H0 , otherwise, do not reject H0.

which is fail in rejection region, thus we reject H0 And the p-value < 0.05 so thus we reject H0

Step 7: interpret the results of the hypothesis test.

At 5% significance level, the data provide sufficient evidence to conclude that there is association between martial status and alcohol consumption.

use SPSS programuse the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below

121


122


SPSS outputStep 3 : the result shown below

123


MARTIAL * DRINK Crosstabulation

67 213 74 354117.9 191.2 44.9 354.0

411 633 129 1173390.6 633.5 148.9 1173.0

85 51 7 14347.6 77.2 18.2 143.0

27 60 15 10234.0 55.1 13.0 102.0590 957 225 1772

590.0 957.0 225.0 1772.0

CountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected Count

single

married

widowed

divorced

MARTIAL

Total

abstain 1-60 over 60DRINK

Total

Chi-Square Tests

94.269a 6 .00093.096 6 .000

32.265 1 .000

1772

Pearson Chi-SquareLikelihood RatioLinear-by-LinearAssociationN of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 12.95.

a.

MARITAL

DivorcedWidowedMarriedSingle

Cou

nt

700

600

500

400

300

200

100

0

DRINKS

Abstain

1-60

Over 60

124


Goodness of Fit Test(A Multinomial Population)

Set up the null and alternative hypotheses.2. Select a random sample and record the observed frequency, fi , for each of the k categories.3. Assuming H0 is true, compute the expected frequency, ei , in each category by multiplying the category probability by the sample size.4. Compute the value of the test statistic.

5. Reject H0 if (where a is the significance level and there are k - 1 degrees of

freedom). Example: Finger Lakes Homes (A)Finger Lakes Homes manufactures four models of prefabricated homes, a two-story colonial, a ranch, a split-level, and an A-frame. To help in production planning, management would like to determine if previous customer purchases indicate that there is a preference in the style selectedThe number of homes sold of each model for 100 sales over the past two years is shown below.

Model Colonial Ranch Split-Level A-Frame # Sold 30 20 35 15

• Notation

pC = population. proportion that purchase a colonialpR = population. proportion that purchase a ranchpS = population. proportion that purchase a split-levelpA = population. proportion that purchase an A-frame

• Hypotheses

H0: pC = pR = pS = pA = .25Ha: The population proportions are not

pC = .25, pR = .25, pS = .25, and pA = .25• Expected Frequencies

e1 = .25(100) = 25 e2 = .25(100) = 25 e3 = .25(100) = 25 e4 = .25(100) = 25

• Test Statistic

125


= 1 + 1 + 4 + 4 = 10

• Conclusion

= 10 > 7.81, so we reject the assumption there is no home style preference, at the .05 level of significance.

Use SPSS program

1. Enter the data as shown below

126


127


Click ok

SPSS OUTPUT

Chi-Square TestMODEL

30 25.0 5.020 25.0 -5.035 25.0 10.015 25.0 -10.0

100

ColonialRanchSplit-LevelA-FrameTotal

Observed N Expected N Residual

Test Statistics

10.0003

.019

Chi-Square a

dfAsymp. Sig.

MODEL

0 cells (.0%) have expected frequencies less than5. The minimum expected cell frequency is 25.0.

a.

P-Value = p( >10)= 0.0186 < 0.05 so we reject Ho

Inferential methods in regression and correlation

128


* Correlations

Correlation Coefficient, r

The quantity r, called the linear correlation coefficient, measures the strength and the direction of a linear relationship between two variables. The linear correlation coefficient is sometimes referred to as the Pearson product moment correlation coefficient in honor of its developer Karl Pearson. The mathematical formula for computing r is:

where n is the number of pairs of data.

The value of r is such that -1 < r < +1. The + and – signs are used for positive linear correlations and negative linear correlations, respectively. Positive correlation: If x and y have a strong positive linear correlation, r is close to +1. An r value of exactly +1 indicates a perfect positive fit. Positive values indicate a relationship between x and y variables such that as values for x increases, values for y also increase. Negative correlation: If x and y have a strong negative linear correlation, r is close to -1. An r value of exactly -1 indicates a perfect negative fit. Negative values indicate a relationship between x and y such that as values for x increase, values for y decrease. No correlation: If there is no linear correlation or a weak linear correlation, r is

129


close to 0. A value near zero means that there is a random, nonlinear relationship between the two variables Note that r is a dimensionless quantity; that is, it does not depend on the units employed. A perfect correlation of ± 1 occurs only when the data points all lie exactly on a straight line. If r = +1, the slope of this line is positive. If r = -1, the slope of this line is negative

Person correlation test

t- distribution for a correlation test

with df= n-2 the null hypothesis versus

or or EXAMPLE:Table below show the age and price data for a sample of 11 orions, test at 5% significance level, do the data provide sufficient evidence to conclude that the age and price of orions are negatively linearly correlated.

car Age Price car Age Price1 5 85 7 6 662 4 103 8 6 953 6 70 9 2 1694 5 82 10 7 705 5 89 11 7 486 5 98

Solution

130


First: by hand1. the null and alternative hypotheses

2.Calculate the data as in table belowage(x) price(y)

x-square y-square x*y

5 85 25 7225 4254 103 16 10609 4126 70 36 4900 4205 82 25 6724 4105 89 25 7921 4455 98 25 9604 4906 66 36 4356 3966 95 36 9025 5702 169 4 28561 3387 70 49 4900 4907 48 49 2304 336

Tot=58 Tot=975 Tot=26 Tot=96129 Tot=4732

3.Substitute in the formula

4. Calculate the test statistics

5. Find the critical value from t distribution table at and degrees of freedom =11-2=9 in one tail(left tail)

131


Tcritical = -1.836.decision : ttest lies in rejection region (-ttest < - tcritical) and the p-value = p(t<-7.249) = 0.00002 < 0.05

Interpret resultsSo we reject the null hypothesis means at 5% significance level, the data provide sufficient evidence to conclude that the age and price of orions are negatively linearly correlated

second: by using SPSS procedure

1.Enter the data , and plot Scatter plots

132


133


Linear Regression

2 3 4 5 6 7

age

50

75

100

125

150

price = 195.47 + -20.26 * ageR-Square = 0.85

2. We find Pearson correlation by using SPSS as follows

134


SPSS Outputs

Correlations

1 -.924**. .000

11 11-.924** 1.000 .

11 11

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

AGE

PRICE

AGE PRICE

Correlation is significant at the 0.01 level(2-tailed).

**.

135


Critical value of t = -1.833

The value of the test statistic falls in the rejection region , and the p-value = 0.0000244 < 0.05 so we reject H0

Interpret resultsat 5% significance level, the data provide sufficient evidence to conclude that the age and price of orions are negatively linearly correlated, prices for orions tend to decrease linearly with increasing age.

Spearman correlationSpearman's Rank Order Correlation using SPSSObjectives

The Spearman Rank Order Correlation coefficient, rs, is a non-parametric measure of the strength and direction of association that exists between two variables measured on at least an ordinal scale. It is denoted by the symbol rs (or the greek letter ,pronounced rho). The test is used for either ordinal variables or for interval data that has failed the assumptions necessary for conducting the Pearson's product-moment correlation.

Assumptions

136


Variables are measured on an ordinal, interval or ratio Variables need NOT be normally distributed. This type of correlation is NOT very sensitive to outliers.

Example

A teacher is interested in those who do the best at English also do better in Maths (assessed by exam) students in English are also the best performers in Maths. She records the scores of her 10 students as they performed in end-of-year examinations for both English and Maths.

English 56 75 45 71 61 64 58 80 76 61Maths 66 70 40 60 65 56 59 77 67 63

Hypothesis :

First, create a table with four columns and label them as below:

English (mark) Maths (mark) Rank (English) Rank (maths) d d2

56 66 9 4 5 2575 70 3 2 1 145 40 10 10 0 071 60 4 7 3 962 65 6.5 5 1 164 56 5 9 4 1658 59 8 8 0 080 77 1 1 0 076 67 2 3 1 161 63 6.5 6 1 1

Where d = difference between ranks and d2 = difference squared.

We then calculate the following:

137


We then substitute this into the main equation with the other information as follows:

as n = 10. Hence, we have r of 0.67. This indicates a strong positive relationship between the ranks individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you ranked in English also, and vice versa.

How do you report a Spearman's correlation?

How you report a Spearman's correlation coefficient depends on whether or not you have determined the statistical significance of the coefficient. If you have simply run the Spearman correlation without any statistical significance tests then you are able to simple state the value of the coefficient as shown below:

Rs = 0.67However, if you have also run statistical significance tests then you need to include some more information as shown below:

at , where N = number of pairwise cases from spearman rank table

Decision

138

http://statistics.laerd.com/statistical-guides/img/spearman-4.jpg

http://statistics.laerd.com/statistical-guides/img/spearman-5.jpg


Rs calculated(=0.67) > ( critical value)

So we reject HO

Conclusion:

There is a positive relationship between the ranks individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you ranked in English also, and vice versa .Note: when the number of pairwise cases are large (>30)

We can use z distribution, and the statistical text as :

= 0.67 *sqrt(10-1)= 2.01

Zcritical = 1.96 from z-table

Zcal =2.01> ztab = 1.96 so we reject HO

There is a positive relationship between the ranks individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you ranked in English also, and vice versa

Test Procedure in SPSS

1. Click Analyze > Correlate > Bivariate... on the menu system as shown below:

139


Published with written permission from SPSS Inc, an IBM Company.

2. Transfer the variables "English_Mark" and "Maths_Mark" into the "Variables" box by dragging-and-dropping or by clicking the button. You will end up with a screen similar to the one below:


140

http://statistics.laerd.com/spss-tutorials/img/spearmans-rank-order-correlation-1.png

http://statistics.laerd.com/spss-tutorials/img/spearmans-rank-order-correlation-2.png


3. Make sure that you uncheck the Pearson tick box (it is selected by default in SPSS) and check the Spearman tick box under the "Correlation Coefficients" group.

4. Click the button.

Output SPSS

You will be presented with 3 tables in output viewer under the title "Correlations" as below:

Correlations

1.000 .673*. .033

10 10.673* 1.000.033 .

10 10

Correlation CoefficientSig. (2-tailed)NCorrelation CoefficientSig. (2-tailed)N

ENGLISH

MATH

Spearman's rhoENGLISH MATH

Correlation is significant at the .05 level (2-tailed).*.


The results are presented in a matrix such that, as can be seen, the correlations are replicated. Nevertheless, the table presents Spearman's Rank Order Correlation, its significance value and the sample size that the calculation was based on. In this example, we can see that Spearman's correlation coefficient, rs, is 0.669 and that this is statistically significant (P = 0.033).

Reporting the Output

In our example you might present the results are follows: A Spearman's Rank Order correlation was run to determine the relationship between 10 students' English and maths exam marks. There was a strong, positive correlation between English and maths marks, which was statistically significant

* Regression inference:

Assumptions for regression inferences

141


1-Population regression line: means that for each value x of the predictor variable , the conditional mean for the response variable is

2-Equal standard deviations ( homoscedasticity) : the conditional standard deviations of the response variable are the same for each values of the predictor variable

3-Normal distributions: for each values of the predictor variable , the condition distribution of the response variable are a normal distribution.

4-Independent observations : the observations of the response variable are independent of one another

Hypothesis test for the slope of the population regression line

Example: in table below, at 5% significance level, the data provide sufficient evidence to conclude that the age is useful as a linear predictor of price for orions ?

car Age Price car Age Price1 5 85 7 6 66

142


2 4 103 8 6 953 6 70 9 2 1694 5 82 10 7 705 5 89 11 7 486 5 98

Solution: ( age is not a useful as a linear predictor of price for orions)

( age is a useful as a linear predictor of price for orions)Age : independent (explanatory) variable Price : dependent ( response)variable Test statistic

(where Se is the Std. Error of the Estimate) where ( sum of

square of errors)

The critical value

143


The value of the test statistic falls in the rejection region , and the p-value = 0.000488 < 0.05 so we reject H0

Interpret the result in the hypothesis test: At 5% significance level, the data provide sufficient evidence to conclude that the( the slope of the population regression line is not 0 and hence that age is useful as a linear predictor of price for orions

SPSS procedure:

144


Regression

145


Variables Entered/Removedb

AGEa . EnterModel1

VariablesEntered

VariablesRemoved Method

All requested variables entered.a.

Dependent Variable: PRICEb.

Model Summary

.924a .853 .837 12.577Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), AGEa.

ANOVAb

8285.014 1 8285.014 52.380 .000a

1423.532 9 158.1709708.545 10

RegressionResidualTotal

Model1



Dependent Variable: PRICEb.

Coefficientsa

195.468 15.240 12.826 .000-20.261 2.800 -.924 -7.237 .000

(Constant)AGE

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: PRICEa.

Determine coefficients

RGRESSION LINE : PRICE = 195.468 – 20.261 * AGE

COEFFICINT OF DETERMINATION = 0.853

146


EXAMPLE

SOLUTION

147


148


AGE

76543210

PR

ICE

280

260

240

220

200

180

160

140

120

100

149


AGE

76543210

PRIC

E

280

260

240

220

200

180

160

140

120

100

Correlations

150


Correlations

1 -.968**. .000

10 10-.968** 1.000 .

10 10

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

AGE

PRICE

AGE PRICE

Correlation is significant at the 0.01 level(2-tailed).

**.

Coefficientsa

291.602 11.433 25.506 .000-27.903 2.563 -.968 -10.887 .000

(Constant)AGE

Model1

B Std. Error


Beta


t Sig.


AGE

76543210

PRIC

E

280

260

240

220

200

180

160

140

120

100

151


Model Summary

.968a .937 .929 14.247Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate


152


153


AGE

76543210

Stan

dard

ized

Res

idua

l

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

AGE

76543210

Uns

tand

ardi

zed

Res

idua

l

30

20

10

0

-10

-20

-30

154


Normal P-P Plot of Regression Standardized Residual

Dependent Variable: PRICE

Observed Cum Prob

1.00.75.50.250.00

Exp

ecte

d C

um P

rob

1.00

.75

.50

.25

0.00

155


Coefficientsa

291.602 11.433 25.506 .000-27.903 2.563 -.968 -10.887 .000

(Constant)AGE

Model1

B Std. Error


Beta


t Sig.


Coefficientsa

291.602 11.433 25.506 .000-27.903 2.563 -.968 -10.887 .000

(Constant)AGE

Model1

B Std. Error


Beta


t Sig.


156


157


158

part one - islamic university of...

Documents