part one - islamic university of...
TRANSCRIPT
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Doing Data Analysis
With SPSS
Dr: Nafez M. Barakat
2012-2013
1
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
How to run SPSS program:
Click the left mouse on Start button and select Programs, and from the list choose SPSS 14.0 for Windows. Waite a few minutes before the program is ready for use.
To Show or Hide a Toolbar: From the menu choose: View – toolbar
2
Menu bar
Tool bar
Variables
Cases
Cell editor
Row number
Variable name
Active cell
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Opening a Data File: From the menu choose File - OpenIn the open file dialog box, select the file you want to open, and click open
Saving Data FilesFrom the menus choose:File – Save As.Select a file type from the drop-down list (SPSS(*.sav)).Enter a file name for the new data file.
Basic Steps for Data Analysis:Analyzing data with SPSS is easy. All you want to do is:
1. Get data into SPSS.2. Select a procedure.3. select the variables for the analysis.4. Run the procedure and look at the results.
Entering Data into the Data Editor:Many of the feature of the data Editor are similar to those found un spreadsheet applications. There are , however, several important distinctions:
Rows are cases : each row represents a case or observation. For example, each individual respondent a questionnaire is a case.
Columns are a variables: each column represents a variable or characteristic being measured. For example, each item on a questionnaire is a variable.
Cells contain values: each cell contains a single value of a variable for a case. The cell is the intersection of the case and variable.
The data file is rectangular: the dimension of the data file are determined by the number of cases and variables
Example : if we have some questions in a questionnaire like that:
3
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Gender male female
Job cat clerical custodial manager
Salary $ ……………..
Enter the data for these questions above in SPSS Data Editor:
At the bottom of the data editor click on the tab Variable View, a different grid appears, with these column headings:
Under Name enter the variable name gender for the first question, jobcat for the second question, and salary for the third question.
Rules apply to variable names: the name must begin with a letter, the remaining characters can be any letter,
any digit, a period, or the symbols @, #,_ , $. Variable names cant end with a period. Blanks and special characters ( for example , !, ?, ', and *) cannot be used. Each variable name must be unique, duplication is not allowed. Variable name
are not case sensitive. The name gender, GENDER, gender are all identical in SPSS.
Some preserved word in SPSS not allowed like not, and, or,….
Define variable type:Click on a small gray button marked with three dots in the type column, you will see this dialog box.
The available data type are: numeric, coma, dot, scientific notation, date, dollar, custom currency, and string.
4
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
The custom currency format CCA, CCB, CCC, CCD, and CCE are defined in the Currency tab of the Options dialog box, accessed from the edit menu.We select Numeric with width equal 8 digits and Decimal places 0 digit for the variables (jobcat and salary), and string for the variable (gender)
You can change the width and decimal places from the columns named by width and column.
Define Labels:
Define Label provides descriptive variables and can be up to 250 ckaracters long, and these descriptive labels are display in output. We write gender , employment category, and current salary for the three our variables.
Coding Variables:
Click on a small gray button marked with three dots in the values column, you will see this dialog box. Click on a small gray button marked with three dots in the type column, you will see this dialog box.
For the variable gender type f in the value box and type female in the value label box, click add. Then type m in the value box and type male in the value label box, click add.For the variable jobcat type 1 in the value box and type clarical in the value label box, click add. Then type 2 in the value box and type custodial in the value label box, click add, and Then type 3 in the value box and type manager in the value label box, click add.The salary variable is quantitative variable and no value label allowed to it.
Define missing values:
5
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Click on a small gray button marked with three dots in the values column, you will see this dialog box.
Define missing values defined specified data as user – missing, and that missing values are excluded from the calculations.
you can enter up three discreet ( individual) missing values, a range of missing values, or range plus one distinct value.
Ranges can only be specified for numeric variables. You cannot define missing values for long string variables.
Define Column Format:
You can defined the width of the column by clicking the mouse on column named by columns , we choose the column width for the three variables equal to 8, and click the align column and choose center.
Measurement of the variables:
Click the mouse on the column named by measure, and choose nominal for the variable gender, order for the variable jobcat, and scale for the variable salary.
6
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
You can now click on data view and entering the data like that:
Inserting new cases:
To insert a new cases between existing cases: select any cell in the case (row) below the position where uou want to insert the
new case. From the menu choose: data> insert case
A new row is inserted for the case and all variable receive the system- missing value.
Inserting new variable:
To insert a new variable between existing variables: select any cell in the variable (column) to the right of the position where you
want to insert the new variable. From the menu choose: data> insert variable
A new variable is inserted with the system- missing value for all cases.
Moving or remove Variables:
7
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Click on the variable you want to remove it at the top of the column. from the menu choose : Edit> Cut. If you want to move the variable , choose: Edit> Pass.
Go To Case:
To Go to Case in the data editor:
make the data editor the active window. From the menu choose : data > Go to Case Enter the data row number for the case and click OK.
Search for Data:
To find a data value in the data editor
Select any cell in the column of the variable you want to search. From the menu choose : edit> find. Enter the data you want to find. Click Fined Next.
Opening a Data File:
Choose File > Open > dataA dialog box like the one shown below:
8
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Select the appropriate directory for your system and you will see a list of available worksheet files. Select the one named employee data, then click Open.
Displaying Distributions with Graphs:
The frequency tables and bar charts and pie chars used only for qualitative data or for small data set.The stem-plots, histograms, and time plots will be used for quantitative variables.
Frequency Tables:
To create a frequency table for a categorical variable, fellow these steps. click analyze> Descriptive Statistics> frequencies, the Frequencies dialog box
appears. Click gender from the left rectangle to move it to right rectangle named by
variables (s):
9
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Click on chart button, the chart dialog box appears below, click on bar chart and from the Chart Values click on Frequency, finally click Continue.
We return to frequency dialog box, click OK , the resulting SPSS for Windows output appears.
Frequencies
[DataSet1] D:\Program Files\SPSSEval\Employee data.sav
This table summarizes how many observations we have in the dataset; her there are 474 observations, we have a valid data value , and there is no missing data.
Statistics
Gender474
0ValidMissing
N
10
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
In this table 216 or 45.6% of employee are female, and 258 or 54.4% of employee are male, the cumulative percent is the percentage of the current category plus the percent of the categories above it .
Q. What is the difference between percent and the valid percent ?
Gender
216 45.6 45.6 45.6258 54.4 54.4 100.0474 100.0 100.0
FemaleMaleTotal
ValidFrequency Percent Valid Percent
CumulativePercent
The graph below show the bar chart for gender, and the height of each bar chart
represent the frequency of the employee.
MaleFemale
Gender
300
250
200
150
100
50
0
Freq
uenc
y
Gender
Editing bar charts:
11
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
double click on the bar chart in the output-SPSS. We get the following chart editor
choose : element > show data label as illustrated below:
the dialog box appear as shown below. Click on text style, write 14 in preferred size box, and then click apply.
12
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
click file >close to close the chart editor , then the following graph appear below.
MaleFemale
Gender
300
250
200
150
100
50
0
Freq
uenc
y
258216
Gender
Compare between the salary of female and male using bar charts:
graph > interactive > Bar.. the following dialog box appear .
13
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
complete the dialog box as shown below, and click OK.
Bars show Means
Female Male
Gender
$10,000
$20,000
$30,000
$40,000
Cur
rent
Sal
ary $26,032
$41,442
Q. from the bar chart above : did the males or female have a better salary? Why?
14
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Another method for comparing the mean of the salary between male and female
Graph > Bar, the following dialog appear .
Choose Simple and Summaries for groups of cases, click on the button marked define:
Complete the dialog box as shown below and if you are interested in including a title or a footnote on the chart , click Titles and type in the desired information, click continue, return to the original dialog box, click ok
15
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
A new windows appear, containing bar charts.
MaleFemale
Gender
$50,000
$40,000
$30,000
$20,000
$10,000
$0
Mea
n C
urre
nt S
alar
y
$41,442
$26,032
One variable Descriptive Statistics:
The Frequency procedure provides statistics and histogram graph for quantitative variables as the following :
analyze > descriptive statistics > frequencies, the following dialog box appear. Move the current salary to rectangle named by variable(s).
16
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
click on the button marked by Statistics , the following dialog appear below; complete the dialog box as shown below , click continue to return to the original dialog box.
percentile values: Values of a salary ( quantitative variable) that divide the ordered data into groups so that a certain percentages is above and another percentage is below. Quartiles (the 25th, 50th, and 75th percentiles) divided the observations into four groups of equal size.
If you want an equal number of groups other than four, select cut points for n equal groups. You can also specify individual percentages( for example , the 77th percentile the value below which 77% of the observations fall).
central Tendency. Statistics that describe the location of the location of the distribution, you can select the mean, median, and mode, or the sum of all the values.
Dispersion: Statistics that measure the amount of variation or spread in the data. You can select the Std. deviation ( Slandered deviation), variance, range, minimum, maximum, or S.E.mean ( standard error of the mean).
Distribution: Statistical that describe the shape and symmetry of the distribution, you can select skewness or kurtosis. These statistics are displayed with there standard errors.
click on the button marked by charts, then click on histogram and on click on the box With normal curve, click continue to return to the original dialog box.
17
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
A histogram also has bars, but they are plotted along an equal interval scale. The height of each bar is the count of values of quantitative variables falling within the interval. The histogram shows the shape, center, and spread of the distribution. A normal curve superimposed on a histogram helps you judge whether the data are normally distributed.
Click OK to get the following results.
Frequencies
This table show the following results:
Statistics
Current Salary474
0$34,419.57
$784.311$28,875.00
$30,750$17,075.661
2915782142.125
.1125.378.224
$119,250$15,750
$135,000$21,000.00$22,950.00$24,000.00$24,825.00$26,700.00$28,875.00$30,750.00$34,500.00$37,162.50$38,850.00$41,100.00$47,550.00$59,700.00
ValidMissing
N
MeanStd. Error of MeanMedianModeStd. DeviationVarianceSkewnessStd. Error of SkewnessKurtosisStd. Error of KurtosisRangeMinimumMaximum
10202530405060707577808490
Percentiles
18
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Q1. Is the distribution symmetric, skewed to the right, skewed to the left ? why ?.Q2. Find the IQR (Inter Quartile Range = Q3 – Q1 = P75 – P25).Q3. you prefer to use range or IQR In this example to determine the dispersion of the data, and why ?
$125,000$100,000$75,000$50,000$25,000$0
Current Salary
120
100
80
60
40
20
0
Freq
uenc
y
Mean =$34,419.57Std. Dev.
=$17,075.661N =474
Histogram
Q4. How would you described the shape of this distribution ?
Descriptives:
The descriptives procedure displays univariate summary statistics for several variables in a single table, and calculates standardized values( z scores). For each variable you can select descriptive computes the mean, standard deviation, minimum, maximum, variance, range, standard error of the mean, and skewness and kurtosis with their standard errors. The median, mode, quartiles and percentiles are not available in descriptive, but they can attained using the frequency procedure variables can be ordered by the size of their means ( in ascending or descending order), alphabetically, or by the order in which you select the variables.
When z scores are saved, they are added to the data in the data editor and are available for SPSS charts, data listings, and analysis. When variables are recorded in different
19
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
units ( for example, salary , education , and experience) a z-score transformation places variables on a common scale for easier visual comparison.
Example :Open the employee data,sav, and find the descriptives for the current salary and discuss the results.To obtain descriptive statistics:
*From the menus choose: analyze > descriptive statistics > descriptives.., The descriptive dialog box appears, click on the current salary to move in the rectangle named by variable(s), and click on the box " save standardized values as variables, as shown below .
optionally you can click Options for optional statistics and display order, as shown in the descriptive options dialog box below.
Click continue to return to the descriptive dialog box, then click OK, to get these result .
20
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Descriptives
[DataSet1] D:\Program Files\SPSSEval\Employee data.sav
Descriptive Statistics
474 474$119,250$15,750
$135,000$16,314,875$34,419.57
$17,075.6612.1255.378
$784.311.112.224
NRangeMinimumMaximumSumMeanStd. DeviationSkewnessKurtosisMeanSkewnessKurtosis
Statistic
Std. Error
Current Salary Valid N (listwise)
Note that we drag an icon from column tray into row tray, and drag an icon from the row tray into the column tray to obtain the result shown above.
EXPLORE:
The explore procedure produces summary statistics either for all of your cases or separately for groups of cases. You can obtain:
graphical dispays, including boxplot, stem and leaf plots, and histograms, with outliers identified.
Frequency tables, percentiles, and other descriptive statistics. Test for normality, including probability plots and Shapiro-Wilk and Lilliefors
tests. Leven's test for assessing equality of variances. Robust estimates of location ( M-estimators).
Reasons for using the explore procedure:
21
Column icon ( salary)Statistics
type
statistics
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
There are many reasons for using the explore procedure – data screening, outlier identification, description, assumption checking, and characterizing differences among subpopulations ( groups of cases). Data screening may show that you have unusual values, extreme values, gaps in data, or other percentiles, exploring the data may indicate that the distribution of the data is normal or not.
Example: open the file named employee data.sav .
From the menus choose: analyze > descriptive statistics > explore..
The following explore dialog box appear, move the salary variable ( quantitative variable) to the rectangle named by Dependent list.
Click statistics for robust estimator, outliers, percentiles, discriptives, and 95% confidence interval for mean, click continue.
Click plots for histograms, stem-and-leaf, normal probability plots with tests, click continue.
22
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
click OK, to obtain the following results.
Explore
[DataSet1] D:\Program Files\SPSSEval\Employee data.sav
This table show that we have 474 valid observation, and no missing value present.
Case Processing Summary
474 100.0% 0 .0% 474 100.0%Current SalaryN Percent N Percent N Percent
Valid Missing TotalCases
The table of descriptives shows several statistics .
95% confidence interval for mean ( lower and upper bound): a confidence interval is arrange used to estimate a population mean.
5% trimmed mean: the 5% trimmed sample mean, computed by omitting the highest and lowest 5% of the sample data.
We discussed the other statistics in the previous sections.
23
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Descriptives
$34,419.57 $784.311$32,878.40
$35,960.73
$32,455.19$28,875.00291578214
$17,075.661$15,750
$135,000$119,250$13,163
2.125 .1125.378 .224
MeanLower BoundUpper Bound
95% ConfidenceInterval for Mean
5% Trimmed MeanMedianVarianceStd. DeviationMinimumMaximumRangeInterquartile RangeSkewnessKurtosis
Current SalaryStatistic Std. Error
The table M-Estimators shows alternatives to sample mean for estimating the center of location. The estimators calculated differ in the weights they apply to cases. Huber's M-estimator, Tukey's Biweight, Hampel's M-Estimator, and Andrew's Wave estimator are displayed.
M-Estimators
$29,434.84 $27,613.71 $28,739.16 $27,599.33Current Salary
Huber'sM-Estimatora
Tukey'sBiweightb
Hampel'sM-Estimatorc
Andrews'Waved
The weighting constant is 1.339.a.
The weighting constant is 4.685.b.
The weighting constants are 1.700, 3.400, and 8.500c.
The weighting constant is 1.340*pi.d.
The table percentiles displays the values for the 5th , 10th ,25th, 50th, 75th , 90th , and 95th percentiles)
Percentiles
$19,200.00 $21,000.00 $24,000.00 $28,875.00 $37,162.50 $59,700.00 $70,218.75
$24,000.00 $28,875.00 $37,050.00
Current Salary
Current Salary
WeightedAverage(Definition 1)Tukey's Hinges
5 10 25 50 75 90 95Percentiles
The table Extreme Values displays the five smaalest values and the five largest values with case labels.
24
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Extreme Values
29 $135,00032 $110,62518 $103,750
343 $103,500446 $100,000378 $15,750338 $15,900411 $16,200224 $16,20090 $16,200
1234512345
Highest
Lowest
Current SalaryCase Number Value
The table tests of normality displays normal probability and detrended normal probability plots. The Kolmogorov-Smirnov statistic, with a Lilliefors significance level for testing normality is displayed. A Shapiro-Wilk statistic calculated for samples with 50 or fewer observation. The significance level equal 0.00 < 0.05 which means that the distribution of the data is not normal.
Tests of Normality
.208 474 .000 .771 474 .000Current SalaryStatistic df Sig. Statistic df Sig.
Kolmogorov-Smirnova Shapiro-Wilk
Lilliefors Significance Correctiona.
We have tow plots, a histogram, and stem-and-leaf, we discussed the histogram plot previously, and now we want to discuss the stem-and-leaf plot:Current Salary
$125,000$100,000$75,000$50,000$25,000
Current Salary
120
100
80
60
40
20
0
Freq
uenc
y
Mean =$34,419.57Std. Dev.
=$17,075.661N =474
Histogram
25
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
In this output , there are three columns of information, representing frequency, stems, and leaves. The stem width equal 10000, and each leaf contain 3 cases, and the extremes value greater than or equal 56750 $, for example look for the first row, which represent 33 values between 10000 and less that 20000, the values are (15000, 15000, 15000, 16000, 16000, 16000, …, 19000).
q. hoe would you describe the shape of tis distribution ? Compare between histogram and stem=and-leaf, what important difference if any, do you see ?
Current Salary Stem-and-Leaf Plot
Frequency Stem & Leaf
33.00 1 . 56667789999 110.00 2 . 00001111111222222222333334444444444 115.00 2 . 555555556666666667777777778888889999999 80.00 3 . 000000000001111112233333444 32.00 3 . 55556677889 20.00 4 . 0001233& 12.00 4 . 5678& 12.00 5 . 0124& 7.00 5 . 556 53.00 Extremes (>=56750)
Stem width: 10000 Each leaf: 3 case(s)
This plot called normal quantile plot. Any data that follow a normal distribution produce a straight line on the normal quantile plot. Systematic deviations from a straight line indicate a nonnormal distribution. Outliers appear as points that are far away from the overall pattern of the plot.We notes that most points lie far from the straight line, indicating that nonnormal distribution.
26
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
120,00090,00060,00030,0000
Observed Value
3
2
1
0
-1
-2
-3
Expe
cted
Nor
mal
Normal Q-Q Plot of Current Salary
27
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
125,000100,00075,00050,00025,0000
Observed Value
4
3
2
1
0
-1
-2
Dev
from
Nor
mal
Detrended Normal Q-Q Plot of Current Salary
This plot called box-plot graph, which illustrate the minimum value, fist quartile (Q1 =P25), second quartile (Q2 = P50), third quartile(Q3 = P75), maximum value, and extreme values ( outliers value) [ we must distinguishes between minor outliers and major outliers, Minor outliers denoted by o in the plot are observation more than 1.5 . IQR outside the central box. Major outliers denoted by * in the plot are observations more than 3*IQR outside the central box.Notes:
if the line represent Q2 ( Median) lie at the middle of the box means the distribution is normal.
if the line represent Q2 ( Median) lie near from Q1 means the distribution is skewed to the right.
if the line represent Q2 ( Median) lie near from Q3 means the distribution is skewed to the left.
28
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
we can compare between two groups of data using Explore data as follows:
Analysis > Discribtive Statistics > Explore Move salary variable under Dependent list rectangle, and move gender
in Factor List rectangle as shown in the dialog box (Explore).
29
Current Salary
$125,000
$100,000
$75,000
$50,000
$25,000
$0
29
32
1810334
27466198
406
Minimum value
Q1=P25
Q2=P50=MedianQ3=P75
Maximum value
Minor outliers
Major outliers
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
click statistics and plot button and choose any statistics you want, to get the following results.
Explore
[DataSet1] D:\Program Files\SPSSEval\Employee data.sav
Gender
Case Processing Summary
216 100.0% 0 .0% 216 100.0%258 100.0% 0 .0% 258 100.0%
GenderFemaleMale
Current SalaryN Percent N Percent N Percent
Valid Missing TotalCases
Descriptives
$26,031.92 $514.258$25,018.29
$27,045.55
$25,248.30$24,300.00
57123688$7,558.021
$15,750$58,125$42,375$7,0131.863 .1664.641 .330
$41,441.78 $1,213.968$39,051.19
$43,832.37
$39,445.87$32,850.00380219336
$19,499.214$19,650
$135,000$115,350$22,675
1.639 .1522.780 .302
MeanLower BoundUpper Bound
95% ConfidenceInterval for Mean
5% Trimmed MeanMedianVarianceStd. DeviationMinimumMaximumRangeInterquartile RangeSkewnessKurtosisMean
Lower BoundUpper Bound
95% ConfidenceInterval for Mean
5% Trimmed MeanMedianVarianceStd. DeviationMinimumMaximumRangeInterquartile RangeSkewnessKurtosis
GenderFemale
Male
Current SalaryStatistic Std. Error
30
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
M-Estimators
$24,606.10 $24,015.98 $24,419.25 $24,005.82$34,820.15 $31,779.76 $34,020.57 $31,732.27
GenderFemaleMale
Current Salary
Huber'sM-Estimatora
Tukey'sBiweightb
Hampel'sM-Estimatorc
Andrews'Waved
The weighting constant is 1.339.a.
The weighting constant is 4.685.b.
The weighting constants are 1.700, 3.400, and 8.500c.
The weighting constant is 1.340*pi.d.
Percentiles
Weighted Average(Definition 1)
$16,950.00$18,660.00$21,487.50$24,300.00$28,500.00$34,890.00$40,912.50$23,212.50$25,500.00$28,050.00$32,850.00$50,725.00$69,325.00$81,312.50
Percentiles51025507590955102550759095
GenderFemale
Male
Current Salary
31
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Extreme Values
371 $58,125348 $56,750468 $55,750240 $54,37572 $54,000
378 $15,750338 $15,900411 $16,200224 $16,20090 $16,20029 $135,00032 $110,62518 $103,750
343 $103,500446 $100,000192 $19,650372 $21,300258 $21,30022 $21,75065 $21,900
12345123451234512345
Highest
Lowest
Highest
Lowest
GenderFemale
Male
Current SalaryCase Number Value
Tests of Normality
.146 216 .000 .842 216 .000
.208 258 .000 .813 258 .000
GenderFemaleMale
Current SalaryStatistic df Sig. Statistic df Sig.
Kolmogorov-Smirnova Shapiro-Wilk
Lilliefors Significance Correctiona.
32
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Current Salary
Histograms
$60,000$50,000$40,000$30,000$20,000
Current Salary
40
30
20
10
0
Freq
uenc
y
Mean =$26,031.92Std. Dev. =$7,558.021
N =216
Histogram
for gender= Female
$125,000$100,000$75,000$50,000$25,000
Current Salary
100
80
60
40
20
0
Freq
uenc
y
Mean =$41,441.78Std. Dev.
=$19,499.214N =258
Histogram
for gender= Male
33
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Stem-and-Leaf Plots
Current Salary Stem-and-Leaf Plot forgender= Female
Frequency Stem & Leaf
2.00 1 . 55 16.00 1 . 6666666666777777 14.00 1 . 88889999999999 31.00 2 . 0000000000000111111111111111111 35.00 2 . 22222222222222222222233333333333333 38.00 2 . 44444444444444444444444444555555555555 22.00 2 . 6666666666677777777777 17.00 2 . 88888899999999999 7.00 3 . 0001111 8.00 3 . 22233333 8.00 3 . 44444555 5.00 3 . 66777 2.00 3 . 88 11.00 Extremes (>=40800)
Stem width: 10000 Each leaf: 1 case(s)
Current Salary Stem-and-Leaf Plot forgender= Male
Frequency Stem & Leaf
1.00 1 . & 18.00 2 . 11222344 64.00 2 . 555556666666677777777888889999 60.00 3 . 0000000000000011111112333344 22.00 3 . 5555667899 16.00 4 . 000023& 11.00 4 . 55678& 9.00 5 . 0124& 10.00 5 . 5569& 8.00 6 . 001& 14.00 6 . 56688& 6.00 7 . 03& 5.00 7 . 58 4.00 8 . && 10.00 Extremes (>=86250)
Stem width: 10000 Each leaf: 2 case(s)
& denotes fractional leaves.
34
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Normal Q-Q Plots
60,00050,00040,00030,00020,00010,0000
Observed Value
3
2
1
0
-1
-2
-3
Expe
cted
Nor
mal
Normal Q-Q Plot of Current Salary
for gender= Female
120,00090,00060,00030,0000
Observed Value
3
2
1
0
-1
-2
-3
Expe
cted
Nor
mal
Normal Q-Q Plot of Current Salary
for gender= Male
35
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Detrended Normal Q-Q Plots
60,00040,00020,000
Observed Value
2.0
1.5
1.0
0.5
0.0
-0.5
Dev
from
Nor
mal
Detrended Normal Q-Q Plot of Current Salary
for gender= Female
125,000100,00075,00050,00025,0000
Observed Value
2.5
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
Dev
from
Nor
mal
Detrended Normal Q-Q Plot of Current Salary
for gender= Male
36
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
MaleFemale
Gender
$125,000
$100,000
$75,000
$50,000
$25,000
$0
Cur
rent
Sal
ary
29
34880
3218103454
413
Q. Compare between male salary and female salary at each table or graph above for the following statistics:
Mean, median, mode, skewness, normality of the distributions, outliers, confidence interval for means.
Bivariate Correlations
The Bivariate Correlation procedure computes Person's correlation coeffient, Spearman's rho and Kendall's tub-b, with their significance levels.Before calculating a correlation coefficient, screen your data for outliers ( which can cause misleading results) and evidence of a linear relationship. Person's correlation coefficient is a measure of linear association. Two variables can be perfectly related, but if the relationship is not linear, person's correlation coefficient is not an appropriate for measuring their association.
Notes: For quantitative , normally distributed variable , use Person's correlation
coefficient. If your data are not normally distributed or have ordered categories, use
Sperman's rho or Kendall's tau-b, which measure the association between rank orders.
Correlation coefficient range from -1 (a perfect negative relationship) and +1 ( a perfect positive relationship). A value of 0 indicates no linear relationship.
37
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
You can choose two-tailed probabilities , or one-tailed probabilities. If the direction of association is known in advance, choose one-tiled. Otherwise, choose two-tailed.
Correlation coefficients significant at the 0.05 level are identified with a single asterisk, and those significant at the 0.01 level are identified with two asterisks.
Example : open the file named by employee data.sav
* From the menus choose : analyze > correlate > bivariat
the Bivariate Correlations dialog box appears, move the variable ( salary, educ, Jobtime, months) to rectangle named Variables
click on Person, Kendall's tau-b, Spearman, and click OK.
The following results appears
Note that the variables are non-normal distribution, so we must use Spearman coefficient.
38
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Correlations
[DataSet1] D:\Program Files\SPSSEval\Employee data.sav
Correlations
1 -.097* .084 .661**.034 .067 .000
474 474 474 474-.097* 1 .003 -.252**.034 .948 .000474 474 474 474
.084 .003 1 .047
.067 .948 .303474 474 474 474
.661** -.252** .047 1
.000 .000 .303474 474 474 474
Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N
Current Salary
Previous Experience(months)
Months since Hire
Educational Level (years)
Current Salary
PreviousExperience(months)
Monthssince Hire
EducationalLevel (years)
Correlation is significant at the 0.05 level (2-tailed).*.
Correlation is significant at the 0.01 level (2-tailed).**.
Nonparametric Correlations
[DataSet1] D:\Program Files\SPSSEval\Employee data.sav
Correlations
1.000 -.023 .105* .688**. .625 .023 .000
474 474 474 474-.023 1.000 .008 -.121**.625 . .856 .008474 474 474 474
.105* .008 1.000 .051
.023 .856 . .273474 474 474 474
.688** -.121** .051 1.000
.000 .008 .273 .474 474 474 474
Correlation CoefficientSig. (2-tailed)NCorrelation CoefficientSig. (2-tailed)NCorrelation CoefficientSig. (2-tailed)NCorrelation CoefficientSig. (2-tailed)N
Current Salary
Previous Experience(months)
Months since Hire
Educational Level (years)
Spearman's rhoCurrent Salary
PreviousExperience(months)
Monthssince Hire
EducationalLevel (years)
Correlation is significant at the 0.05 level (2-tailed).*.
Correlation is significant at the 0.01 level (2-tailed).**.
Q1. Is there a relationship between education and salary variables, ?Q2. Is there a relationship between education and months since hire? variables, ?Q3. Is there a relationship between education and Previous Experience variables, ?
Note: we can graph a scatter plot between any two variables for two different groups as follows:
graph > interactive > scatterplot.. as follow:
39
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
complete the create scatterplot dialog as shown below
Click on Fit and choose regression method.
Click OK, OK, the results follows,
40
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Interactive Graph
[DataSet1] D:\Program Files\SPSSEval\Employee data.sav
Linear Regression
$20,000 $40,000 $60,000 $80,000
Beginning Salary
$40,000
$80,000
$120,000
$160,000
Cur
rent
Sal
ary
Current Salary = 438.51 + 1.95 * salbeginR-Square = 0.58
Female Male
$20,000 $40,000 $60,000 $80,000
Beginning Salary
Current Salary = 4083.08 + 1.84 * salbeginR-Square = 0.74
41
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Selecting cases
In this section we demonstrates how SPSS can be used to select n cases from finite population of interest using simple random sampling.
Example : select the cases related to males students .
Choose from the menu:
Data Select Cases...
Select If condition is satisfied.
Click If.
Select gender to pasted in the Expression area.
42
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Select "=" on the calculator pad.
To complete the expression, type 1
Click Continue.
43
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Click OK in the Select Cases dialog box.
The figure below show the 10 cases.
Remarks:
1: if we want to select random cases we follow this procedures:
Choose from the menu:
Data Select Cases...
Random sample of cases
Sample,
If we you want to select 50% from the cases randomly , write 50 inside the box (approximately)
44
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
And if you want to select 5 cases from the first 10 cases write in the box exactly, and write 10.
Click OK in the Select Cases dialog box
2. if you want to select cases that fall within the encusive case( row) range or date/time range. Date and time ranges are only available for time-series data with defined data variables ( Data menue, Define Date). All values must be positive integers.
Choose from the menu:
Data Select Cases...
Random sample of cases
Sample,
Based on time or case range
Range
45
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
If you want to select from the third cases to tenth cases , write 3 in the box " first case" and 10 in the box " last case"
Click Continue.
Click OK in the Select Cases dialog box.
46
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Inference for
Distributions
Hypothesis Test for One Population Mean
One sample t test for population MeanDefinition : The One-Sample T Test compares the mean score of a sample to a known value. Usually, the known value is a population mean.
Definition : Null hypotheses and Alternative hypothesisNull hypotheses : a hypothesis to be tested, We use the symbol H0 to represent the null hypothesis.Alternative hypothesis: a hypothesis to be conceder as alternative to null hypothesis, We use the symbol Ha to represent the alternative hypothesis.Hypotheses:Null: There is no significant difference between the sample mean and the population mean.Alternate: There is a significant difference between the sample mean and the population mean
47
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
We present two step by step procedure for performing a one sample t-test. Procedure (I) covers the critical-value approach, and Procedure (II) covers the p-value approach.
One sample t test for population Mean
(critical-value approach)Assumptions
1. Normal population or large sample
2. unknown
Step 1: the null hypothesis is and the alternative hypothesis is
Step 2 : decide on the significance level, Step 3: compute the value of the test statistic
Step 4: the critical value (s) are or
or with degrees of freedom (df= n-1)
48
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 5 : if the value of the t test statistics falls in the rejection region, reject HO ; otherwise, fail to reject H0 Step 6 : interpret the results of the hypothesis test.
Example : table below show the pH levels for 15 lakes; test if the lakes has pH greater than 6 at 5% significant level.( use the critical value approach)
7.2 7.3 6.1 6.9 6.6 7.3 6.3 5.56.3 6.5 5.7 6.9 6.7 7.9 5.8
Solution : Step 1: state the null and alternative hypotheses
( mean PH Level is not greater than 6) (mean PH Level is greater than 6)
Step 2 : decide on the significance level, Step 3: compute the value of the test statistic
49
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 4: the critical value for a right-tailed test is (from table) with df = 15-1 = 14
Step 5: the value of the test statistic, found in step 3 is T=3.458 fail in the rejection region. Consequently , we reject HO
One sample t test for population Mean
(P-Value Approach)Assumptions
3. Normal population or large sample
4. unknown
Step 1: the null hypothesis is and the alternative hypothesis is
Step 2 : decide on the significance level, Step 3: compute the value of the test statistic
Step 4: find the p-value by using tablewith degrees of freedom (df= n-1)
50
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 5 : if the P- value less than or equal , ( ), reject HO ; otherwise, fail to reject H0 Step 6 : interpret the results of the hypothesis test.
51
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Example : table below show the pH levels for 15 lakes; test if the lakes has pH greater than 6 at 5% significant level. ( use the p-value Approach)Solution :
Step 1: state the null and alternative hypotheses ( mean PH Level is not greater than 6)
(mean PH Level is greater than 6)Step 2 : decide on the significance level, Step 3: compute the value of the test statistic
Step 4: the p-value = p ( t>= 3.458) = 0.00192 (with df = 15-1 = 14 )
Step 5: p value < 0.05) so we reject HO
Interval Estimation
Interval Estimation of a Population Mean: with s Unknown
Interval Estimate
where 1 -a = the confidence coefficient
52
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
ta/2 = the t value providing an area of a/2 in the upper tail of a t
distribution with n - 1 degrees of freedom s = the sample standard deviation
n = sample size
example : suppose that we have a sample employees salary with the following information : n = 10, mean = $550, slandered deviation = $60, we want to estimate a 95% confidence interval of the mean, assume this population to be normally distributed:solution : At 95% confidence, 1 - a = .95, a = .05, and a/2 = .025.
t.025 is based on n - 1 = 10 - 1 = 9 degrees of freedom.
In the t distribution table we see that t.025 = 2.262Interval Estimation of a Population Mean:
= 550 + 42.92or $507.08 to $592.92
We are 95% confident that the mean salary of the population is between $507.08 and $592.92.
use SPSS programexample 1: use the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below
53
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 3 : the result shown belowOne-Sample Statistics
15 6.600 .6719 .1735PHN Mean Std. Deviation
Std. ErrorMean
54
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
One-Sample Test
3.459 14 .004 .600 .228 .972PHt df Sig. (2-tailed)
MeanDifference Lower Upper
95% ConfidenceInterval of the
Difference
Test Value = 6
example 2: use spa file called training to test if the mean of training time equal 60 days, also find 95% confidence interval for the mean population
solution :
Step 1: state the null and alternative hypotheses ( mean training equal 60 days)
(mean training not equal 60 days)Step 2 : decide on the significance level, Step 3: compute the value of the test statistic, from output t = -3.482
Step 4: the p-value = 2*p ( t>= 3.482) = 0.004 (with df = 15-1 = 14 )Step 5: the value of the test statistic, found in step 3 is T=-3.482 fail in the rejection region (-2.14, 2.14). Consequently , we reject HO
or the p-value =0.004 < 0.05 so we reject HO
55
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SPSS output :One-Sample Statistics
15 53.87 6.823 1.762TIMEN Mean Std. Deviation
Std. ErrorMean
One-Sample Test
-3.482 14 .004 -6.13 -9.91 -2.35TIMEt df Sig. (2-tailed)
MeanDifference Lower Upper
95% ConfidenceInterval of the
Difference
Test Value = 60
95% confidence interval for the mean population
56
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SPSS OUTPUT
95% Confidence Interval for Mean Lower Bound 50.09
Upper Bound 57.65
=[50.09, 57.65]NOTE that the mean test = 60 not include in the C.I so we reject null hypotheses
NONPARAMETRIC TESTUse Sign Test (Binomial Test)
57
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Hemoglobin
14.013.012.011.010.09.08.07.06.0
Histogram
Freq
uenc
y
6
5
4
3
2
1
0
Std. Dev = 2.15
Mean = 8.9
N = 15.00
Tests of Normality
.236 15 .024 .858 15 .023HemoglobinStatistic df Sig. Statistic df Sig.
Kolmogorov-Smirnova Shapiro-Wilk
Lilliefors Significance Correctiona.
58
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
59
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Binomial Test
<= 8 4 .29 .50 .180> 8 10 .71
14 1.00
Group 1Group 2Total
HemoglobinCategory N
ObservedProp. Test Prop.
Exact Sig.(2-tailed)
60
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
61
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Inference For Two Population MeanThe pooled t test for two population means(critical-value approach)Assumptions1.independent samples2.normal populations or large samples3.equal population standard deviations
Step 1: the null hypothesis is and the alternative hypothesis is
Step 2 : decide on the significance level, Step 3: compute the value of the test statistic
Where
Step 4: the critical value (s) are or
or with degrees of freedom (df= n1 + n2 -2)
62
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 5 : if the value of the t test statistics falls in the rejection region, reject HO ; otherwise, fail to reject H0 Step 6 : interpret the results of the hypothesis test.
The pooled t test for two population means(p-value approach)Assumptions1.independent samples2.normal populations or large samples3.equal population standard deviations
Step 1: the null hypothesis is and the alternative hypothesis is
Step 2 : decide on the significance level, Step 3: compute the value of the test statistic
63
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Where
Step 4: the value of t-statistics has df= n1 + n2 -2. Use a table to estimate the p-value or obtain it exactly by using technology.
64
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 5 : if the P- value less than or equal , ( ), reject HO ; otherwise, fail to reject H0 Step 6 : interpret the results of the hypothesis test. Example : we perform a hypotheses test to decide whether there is a difference between the mean salaries of faculty in public and private institutions. Independent random samples of 20 faculty members in public institutions and 35 faculty members in private institutions yield in the data in table below. At the 5% significance level, do the data provide sufficient evidence to conclude that means salaries for faculty in public and private institutions differ?
Annual salary ($1000s)for 30 faculty members in public institutions and 35 faculty members in private institutions
Sample 1 (public institutions) Sample 2 (private institutions)34.2 56.8 58.2 29.2 60.2 92.9 62.9 45.2 66.3 47.2 71.090.0 41.4 76.8 15.8 88.2 52.0 53.8 76.0 31.1 59.3 97.3100.4 35.0 84.2 33.8 44.6 63.1 101.0 56.1 71.1 97.5 92.624.6 54.2 79.4 40.2 64.4 118.5 68.6 77.6 73.5 27.2 56.0107.4 24.4 42.2 51.2 74.0 37.7 51.5 61.6 67.6 81.2 62.363.6 56.0 81.8 41.2 71.0 102.2 46.4 78.3 52.4 24.8
Solution:Summary statistics for the samples
public institutions private institutions
Step 1: statethe null hypothesis and the alternative hypothesis ( mean salaries are the same)
( mean salaries are the different) Step 2 : decide on the significance level,
Step 3: compute the value of the test statistic
Where
65
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Critical-value approachStep 4: the critical value (s) are
with degrees of freedom (df= n1 + n2 -2)
From a table the critical values wit (df = 30+35-2=63) are
Step 5 : if the value of the t test statistics falls in the rejection region, reject HO ; otherwise, fail to reject H0 From step 3 the value of the test statistics is t =-1.554, which does not fall in the rejection region, thus we do not reject HO .Step 6 : interpret the results of the hypothesis test.at 5% significance level, the data do not provide sufficient evidence to conclude that a difference exist between the mean salaries of faculty in public and private institutions . p-value approachStep 4: from a table(with df = 63) the p-value ( in two tailed) greater than 0.1 and less than 0.20 ( 0.1 < p < 0.2) , and by using technology, we obtain the p-value = 2 p ( t>= 1.554) = 0.125 (with df = 63)
66
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 5: p value < 0.05) so we reject HO
at 5% significance level, the data do not provide sufficient evidence to conclude that a difference exist between the mean salaries of faculty in public and private institutions . use SPSS programuse the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below
67
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 3 : the result shown below
68
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Tests of Normality
.080 35 .200* .980 35 .755
.105 30 .200* .975 30 .680
TYPEPRIVPUBL
SALARYStatistic df Sig. Statistic df Sig.
Kolmogorov-Smirnova Shapiro-Wilk
This is a lower bound of the true significance.*.
Lilliefors Significance Correctiona.
Group Statistics
30 57.480 23.9528 4.373235 66.394 22.2611 3.7628
TYPEPUBLPRIV
SALARYN Mean Std. Deviation
Std. ErrorMean
Independent Samples Test
.458
.501-1.554 -1.545
63 59.853.125 .128
-8.914 -8.914
5.7363 5.7692
-20.3774 -20.45492.5488 2.6264
FSig.
Levene's Test forEquality of Variances
tdfSig. (2-tailed)Mean Difference
Std. Error Difference
LowerUpper
95% Confidence Intervalof the Difference
t-test for Equality ofMeans
SALARY
Equal variancesassumed
Equal variancesnot assumed
Mann-Whitney-Wilcoxon Test This test is another nonparametric method for determining
whether there is a difference between two populations.
This test, unlike the Wilcoxon signed-rank test, is not based on a matched sample.
This test does not require interval data or the assumption that both populations are normally distributed.
The only requirement is that the measurement scale for the data is at least ordinal
69
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Instead of testing for the difference between the means of two populations, this method tests to determine whether the two populations are identical.
The hypotheses are:
H0: The two populations are identicalHa: The two populations are not identical
Example : consider the independent of these tow groups as follows:
Rank the data from the lowest to highest as the following table
70
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
So U = 7Making decisionWe reject Ho when the value of U less than or equal critical value UO
from table of Mann-Whitney UO = 3 , so we fail to reject the null hypotheses , that means
Notes:1.When the alternative hypothesis we reject HO IF U1 <UO
2.When the alternative hypothesis we reject HO IF U2 <UO
3.when the sample size to one sample or both are large > 20, we use the standardized normal distribution as the following:
Where and
And we reject Ho when the absolute value of Zcal greater that critical value Ztab
use SPSS programuse the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below
71
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 2 : the result shown belowMann-Whitney Test
Ranks
4 4.25 17.005 5.60 28.009
TYPEgroup Egroup CTotal
SCORESN Mean Rank Sum of Ranks
72
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Test Statisticsb
7.00017.000
-.735.462
.556a
Mann-Whitney UWilcoxon WZAsymp. Sig. (2-tailed)Exact Sig. [2*(1-tailedSig.)]
SCORES
Not corrected for ties.a.
Grouping Variable: TYPEb.
73
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
74
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
75
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
76
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
77
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Inference For Two Population MeanThe paired t test for two population means(critical-value approach)Assumptions1.paird samples2.normal populations or large samples
Step 1: the null hypothesis is and the alternative hypothesis is
Step 2 : decide on the significance level, Step 3: compute the value of the test statistic
Where where d = paired differenceStep 4: the critical value (s) are
or or
with degrees of freedom (df= n-1)
78
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 5 : if the value of the t test statistics falls in the rejection region, reject HO ; otherwise, fail to reject H0 Step 6 : interpret the results of the hypothesis test.
The paired t test for two population means(p-value approach)Assumptions
79
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
1.independent samples2.normal populations or large sample
Step 1: the null hypothesis is and the alternative hypothesis is
Step 2 : decide on the significance level, Step 3: compute the value of the test statistic
Where where d = paired difference
Step 4: the value of t-statistics has df= n-1. Use a table to estimate the p-value or obtain it exactly by using technology.
80
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 5 : if the P- value less than or equal , ( ), reject HO ; otherwise, fail to reject H0 Step 6 : interpret the results of the hypothesis test. Example : the gas mileages of 10 randomly selected cars, both with and without a new gasoline additive, are displayed in the second and third columns in table below.At the 5% significance level, do the data provide sufficient evidence to conclude that, on average, the gasoline additive improves gas mileage?
Car Gas mileage with additive
Gas mileage without additive
Paired difference
1 25.7 24.9 0.82 20.0 18.8 1.23 28.4 27.7 0.74 13.7 13.0 0.75 18.8 17.0 1.86 12.5 11.3 1.27 28.4 27.8 0.68 8.1 8.2 -0.19 23.1 23.1 0
10 10.4 9.9 0.5
81
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Solution:Step 1: state the null hypothesis and the alternative hypothesis denote the mean gas mileage when the additive is used
denote the mean gas mileage when the additive is not used ( mean gas mileage with additive is not greater)
(mean gas mileage with additive is greater) Note that the hypothesis test is right-tailed because a greater than sign (>) appears in the alternative hypothesis. Step 2 : decide on the significance level,
Step 3: compute the value of the test statistic
Critical-value approachStep 4: the critical value (s) are
with degrees of freedom (df= n-1)
From a table the critical values wit (df = 10-1=9) are
Step 5 : if the value of the t test statistics falls in the rejection region, reject HO ; otherwise, fail to reject H0
82
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
From step 3 the value of the test statistics is t =4.134, which fall in the rejection region, thus we do reject HO .Step 6 : interpret the results of the hypothesis test.at 5% significance level, the data provide sufficient evidence to conclude that, the gasoline additive improves gas mileage p-value approachStep 4: from a table(with df = 10-1) the p-value ( in right tailed) is the probability of observing a value of t of 4.134 or greater, we find that p < 0.005 ( using technology, we obtain p = .0013)Step 5: p value < 0.05) so we reject HO
at 5% significance level, the data provide sufficient evidence to conclude that, the gasoline additive improves gas mileage use SPSS programuse the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below
83
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 3 : the result shown below
T-TestPaired Samples Statistics
18.9100 10 7.47209 2.3628818.1700 10 7.42848 2.34909
ADDITIWITH.ADD
Pair1
Mean N Std. DeviationStd. Error
Mean
Paired Samples Correlations
10 .997 .000ADDITI & WITH.ADDPair 1N Correlation Sig.
Paired Samples Test
.7400 .56608 4.134 9 .003ADDITI - WITH.ADDPair 1Mean Std. Deviation
Paired Differencest df Sig. (2-tailed)
84
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Example:
85
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
86
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
87
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
From wilcoxon signed ranks table we fined the critical value at N=14 and , so W = 6 < 21 so we reject null
hypothesis
Note : if n > 15 we can use normal distribution for testing wilcoxon signed ranks where the Z statistic as follow:
Where
And we reject H0 if absolute value of Zcal > critical value Ztab
Example using SPSS program
88
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Wilcoxon Signed Ranks Test
89
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Ranks
3a 2.33 7.009b 7.89 71.000c
12
Negative RanksPositive RanksTiesTotal
DRUG_B - DRUG_AN Mean Rank Sum of Ranks
DRUG_B < DRUG_Aa.
DRUG_B > DRUG_Ab.
DRUG_A = DRUG_Bc.
Test Statisticsb
-2.511a
.012ZAsymp. Sig. (2-tailed)
DRUG_B -DRUG_A
Based on negative ranks.a.
Wilcoxon Signed Ranks Testb.
90
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SPSS output
Wilcoxon Signed Ranks TestRanks
56a 79.35 4443.5089b 69.01 6141.5062c
207
Negative RanksPositive RanksTiesTotal
Cases per 100,000population, 1993 -Cases per 100,000population, 1992
N Mean Rank Sum of Ranks
Cases per 100,000 population, 1993 < Cases per 100,000 population, 1992a.
Cases per 100,000 population, 1993 > Cases per 100,000 population, 1992b.
Cases per 100,000 population, 1992 = Cases per 100,000 population, 1993c.
Test Statisticsb
-1.678a
.093ZAsymp. Sig. (2-tailed)
Cases per100,000
population,1993 - Casesper 100,000population,
1992
Based on negative ranks.a.
Wilcoxon Signed Ranks Testb.
91
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
92
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Wilcoxon Signed Ranks TestRanks
11a 6.09 67.001b 11.00 11.000c
12
Negative RanksPositive RanksTiesTotal
Pronethalol - PlaceboN Mean Rank Sum of Ranks
Pronethalol < Placeboa.
Pronethalol > Placebob.
Placebo = Pronethalolc.
93
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Test Statisticsb
-2.201a
.028ZAsymp. Sig. (2-tailed)
Pronethalol -Placebo
Based on positive ranks.a.
Wilcoxon Signed Ranks Testb.
Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA) can be used to test for the equality of three or more population means using data obtained from observational or experimental studies.
We want to use the sample results to test the following hypotheses.
H0: 1 = 2 = 3 = . . . = k Ha: Not all population means are equal
If H0 is rejected, we cannot conclude that all population means are different.
Rejecting H0 means that at least two population means have different values.
Assumptions for Analysis of Variance
1. For each population, the response variable is normally distributed.
2. The variance of the response variable, denoted , is the same for all of the populations.
3. The observations must be independent
Between-Samples Estimate of Population Variance
94
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
The numerator of MSB is called the sum of squares between (SSB).
he denominator of MSB represents the degrees of freedom associated with SSB.
Within-Samples Estimate of Population Variance The estimate of based on the variation of the sample observations within each sample is called the mean square within (MSW).
The numerator of MSW is called the sum of squares within (SSW). The denominator of MSW represents the degrees of freedom associated with SSW.
Comparing the Variance Estimates: The F Test If the null hypothesis is true and the ANOVA
assumptions are valid, the sampling distribution of MSB/MSW is an F distribution with MSB d.f. equal to k - 1 and MSW d.f. equal to nT - k.
If the means of the k populations are not equal, the value of MSB/MSW will be inflated because MSB overestimates .
Hence, we will reject H0 if the resulting value of MSB/MSW appears to be too large to have been selected at random from the appropriate F distribution.
Test for the Equality of k Population Means
Hypotheses
95
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
H0: 1 = 2 = 3 = . . . = k Ha: Not all population means are equal
Test StatisticF = MSB/MSW
Rejection Rule
Reject H0 if F > Fwhere the value of F is based on an F distribution with k
- 1 numerator degrees of freedom and nT - 1 denominator degrees of freedom.Sampling Distribution of MSTR/MSE The figure below shows the rejection region associated with a level of significance equal to where F denotes the critical value.
The ANOVA Table
SST divided by its degrees of freedom nT - 1 is simply the overall sample variance that would be obtained if we treated the entire nT observations as one data set.
96
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Example: Reed ManufacturingWe would like to know if the mean number of hours worked per week is the same for the department managers at her three manufacturing plants (Buffalo, Pittsburgh, and Detroit).
A simple random sample of 5 managers from each of the three plants was taken and the number of hours worked by each manager for the previous week is shown on ONE WAY ANOVA FILE.
Hypotheses
H0: Ha: Not all the means are equal
where: = mean number of hours worked per week by the managers at Plant 1
= mean number of hours worked per week by the managers at Plant 2= mean number of hours worked per week by the managers at Plant 3
• Mean Square Between
Since the sample sizes are all equal = (55 + 68 + 57)/3 = 60
SSB = 5(55 - 60)2 + 5(68 - 60)2 + 5(57 - 60)2 = 490 MSB = 490/(3 - 1) = 245
• Mean Square Within
97
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SSW = 4(26.0) + 4(26.5) + 4(24.5) = 308 MSW = 308/(15 - 3) = 25.667
• F – Test
If H0 is true, the ratio MSB/MSW should be near 1 since both MSB and MSW are estimating . If Ha is true, the ratio should be significantly larger than 1 since MSB tends to overestimate .
• Rejection Rule
Assuming = .05, F.05 = 3.89 (2 d.f. numerator, 12 d.f. denominator). Reject H0 if F > 3.89
• Test Statistic
F = MSB/MSW = 245/25.667 = 9.55• Conclusion
F = 9.55 > F.05 = 3.89, and P-VALUE = P( F> 9.55) = 0.0033 < 0.05 , so we reject H0. The mean number of hours worked per week by department managers is not the same at each plant.
• ANOVA Table
Multiple Comparison ProceduresSuppose that analysis of variance has provided statistical evidence to reject the null hypothesis of equal population means. Fisher’s least significance difference (LSD) procedure can be used to determine where the differences occur.
Hypotheses
H0:
98
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Ha: Test Statistic
Rejection Rule
Reject H0 if t < -ta/2 or t > ta/2
where the value of ta/2 is based on a t distribution with nT - k degrees of freedom
Fisher’s LSD Procedure Based on the Test Statistic n Hypotheses
H0: Ha:
n Test Statistic
n Rejection Rule
Reject H0 if | | > LSD
Where
Example: Reed Manufacturing
Fisher’s LSD
Assuming a = .05,
Hypotheses (A) H0: Ha:
99
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Test Statistic | | = |55 - 68| = 13
• Conclusion
The mean number of hours worked at Plant 1 is not equal to the mean number worked at Plant 2.
Fisher’s LSD
Assuming a = .05,
• Hypotheses (B)
H0: Ha:
• Test Statistic
| | = |55 - 57| = 2• Conclusion
There is no significant difference between the mean number of hours worked at Plant 1 and
the mean number of hours worked at Plant 3.
Fisher’s LSD
Hypotheses (C) H0:
Ha: • Test Statistic
| | = |68 - 57| = 11• Conclusion
The mean number of hours worked at Plant 2 is not equal to the mean number worked at Plant 3.
100
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SPSS PROCEGERS Open file named One WAY ANOVA
101
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SPSS output Descriptives
SCORES
5 55.00 5.099 2.280 48.67 61.33 48 625 68.00 5.148 2.302 61.61 74.39 63 745 57.00 4.950 2.214 50.85 63.15 51 63
15 60.00 7.550 1.949 55.82 64.18 48 74
buffaloPITTSBURGHdetroitTotal
N Mean Std. Deviation Std. Error Lower Bound Upper Bound
95% Confidence Interval forMean
Minimum Maximum
ANOVA
SCORES
490.000 2 245.000 9.545 .003308.000 12 25.667798.000 14
Between GroupsWithin GroupsTotal
Sum ofSquares df Mean Square F Sig.
Post Hoc Tests
102
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Multiple Comparisons
Dependent Variable: SCORESLSD
-13.00* 3.204 .002 -19.98 -6.02-2.00 3.204 .544 -8.98 4.9813.00* 3.204 .002 6.02 19.9811.00* 3.204 .005 4.02 17.982.00 3.204 .544 -4.98 8.98
-11.00* 3.204 .005 -17.98 -4.02
(J) GROUPPITTSBURGHdetroitbuffalodetroitbuffaloPITTSBURGH
(I) GROUPbuffalo
PITTSBURGH
detroit
MeanDifference
(I-J) Std. Error Sig. Lower Bound Upper Bound95% Confidence Interval
The mean difference is significant at the .05 level.*.
Example : energy consumption
Independent random samples of household in four regions yielded the data on last year's energy consumptions shown in table below.At 5% significance level, do the data provide sufficient evidence to conclude that a difference to conclude that a difference exist in the last year's mean energy consumption by households among the four regions.
Northeast
Midwest
south west
15 17 11 1010 12 7 1213 18 9 814 13 13 713 15 9
12
solution :
1. State the null and alternative hypotheses
103
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
( mean energy consumptions are equal)Ha : not all the means are equal.
2. Decide on significance level,
We are perform the hypothesis test at the 5% significance level, consequently ,
3. Compute F statistic and critical value
From the results F statistics = 6.32, and from the table the value of critical value for F = 3.24 at df|(k-1,n-k) where k = numbers of population, n = total number of observations
4.conclution if the value of the F-Statistics falls in the rejection region, or the p-value less than , reject H0 , otherwise don’t reject H0
F-statistics = 6.32 > Ttabulated ( falls in the rejection region) , and the p-value = 0.00495 < 0.05 , so we reject H0
5.Interpret the resultsAt 5% significance level, the data provides sufficient evidence to conclude that difference exist in the last year's mean energy consumption by households among the four regions.use SPSS programuse the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below
104
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
105
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 3 : the result shown below
106
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Descriptives
ENERGY
5 13.00 1.871 .837 10.68 15.326 14.50 2.588 1.057 11.78 17.224 10.00 2.582 1.291 5.89 14.115 9.20 1.924 .860 6.81 11.59
20 11.90 3.076 .688 10.46 13.34
northeastmidwestsouthwestTotal
N Mean Std. Deviation Std. Error Lower Bound Upper Bound
95% Confidence Interval forMean
Test of Homogeneity of Variances
ENERGY
.844 3 16 .490
LeveneStatistic df1 df2 Sig.
ANOVA
ENERGY
97.500 3 32.500 6.318 .00582.300 16 5.144
179.800 19
Between GroupsWithin GroupsTotal
Sum ofSquares df Mean Square F Sig.
Post Hoc Tests
107
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Multiple Comparisons
Dependent Variable: ENERGY
-1.50 1.373 .7573.00 1.521 .3103.80 1.434 .1121.50 1.373 .7574.50 1.464 .0545.30* 1.373 .013
-3.00 1.521 .310-4.50 1.464 .054
.80 1.521 .963-3.80 1.434 .112-5.30* 1.373 .013-.80 1.521 .963
-1.50 1.348 .8773.00 1.538 .4863.80 1.200 .0771.50 1.348 .8774.50 1.668 .1805.30* 1.363 .022
-3.00 1.538 .486-4.50 1.668 .180
.80 1.551 .997-3.80 1.200 .077-5.30* 1.363 .022-.80 1.551 .997
(J) REGIONmidwestsouthwestnortheastsouthwestnortheastmidwestwestnortheastmidwestsouthmidwestsouthwestnortheastsouthwestnortheastmidwestwestnortheastmidwestsouth
(I) REGIONnortheast
midwest
south
west
northeast
midwest
south
west
Scheffe
Tamhane
MeanDifference
(I-J) Std. Error Sig.
The mean difference is significant at the .05 level.*.
108
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
109
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
110
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SPSS PROCEGERS :1. ENTER THE DATA
111
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SPSS POTPUTS Kruskal-Wallis Test
Ranks
5 8.005 11.106 16.176 10.08
22
GROUPABCDTotal
SCORESN Mean Rank
Test Statisticsa,b
4.8613
.182
Chi-SquaredfAsymp. Sig.
SCORES
Kruskal Wallis Testa.
Grouping Variable: GROUPb.
112
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
P-VALUE =P( = 0.1823 < 0.05 , so we fail to reject Ho
113
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
114
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SPSS OUTPUTS Kruskal-Wallis Test
Ranks
46 138.8245 142.3122 71.5048 89.6711 54.0935 64.76
207
WHO RegionAfricaAmericasEastern MediterraneanEuropeSouth-East AsiaWestern PacificTotal
Cases per 100,000population, 1993
N Mean Rank
Test Statisticsa,b
67.2525
.000
Chi-SquaredfAsymp. Sig.
Cases per100,000
population,1993
Kruskal Wallis Testa.
Grouping Variable: WHO Regionb.
115
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Critical = 5.99p-value = p( > 5.22)= .0735 > 0.05So we fail to reject Ho
Contingency tables association (chi-square test)
116
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Chi square independence test used to decide, based on the samples data, whether two variable of a population are statistically related
Assumptions:1. All expected frequencies are 1 or greater2. At most 20% of all expected frequencies are
less than 53. The variables must be categorical data
Step 1: the null and alternative hypotheses are
H0 : The two variables under consideration are not associated
Ha : The two variables under consideration are associated
Step 2: calculate the expected frequency by using the formula
Where R= Row total, C= column total, and n= sample size, place each expected frequency below its corresponding observed frequency in the contingency table.
Step 3: determine whether the expected frequency satisfy assumptions 1 ,2, and 3. If they do not, this procedure should not be used.Step 4 : decide on the significance level
117
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Step 5: compute the value of the test statistics
Where E and O represent expected and observed frequencies, respectively.
Step 6 : the critical value is with df =(r-1)(c-1), where r and c are the number of categories in each variables , use table to fined the critical value.
Step 7: if the value of the test statistics falls in the rejection region, or the P-Value reject H0 , otherwise, do not reject H0.
Step 7: interpret the results of the hypothesis test.
EXAMPLE : A random sample of 1772 U.S. adults yielded the data on marital status and alcohol consumption displayed in table below. At 5% significance level, do the data provide sufficient evidence to conclude that an association exist between martial statues and alcohol consumption?
118
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Cross tabulationDrinks per month
Abstain 1-60 Over 60 Total
Mar
ital
stat
usSingle 67 213 74 354Married 411 633 129 1173Widowed 85 51 7 143Divorced 27 60 15 102Total 590 957 225 1772
Solution :Step 1: the null and alternative hypotheses are
H0 : marital status and alcohol consumptions are not associated
Ha : marital status and alcohol consumptions are associated
Step 2: calculate the expected frequency by using the formula
Where R= Row total, C= column total, and n= sample size, place each expected frequency below its corresponding observed frequency in the contingency table.
Cross tabulationDrinks per month
Abstain 1-60 Over 60 Total
Mar
ital
Single O= 67E= 117.9
213191.2
7444.9
354
119
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
stat
usMarried 411
390.6633
633.5129
148.9 1173
Widowed 8547.6
5177.2
718.2 143
Divorced 2734.0
6055.1
1513.0 102
Total 590 957 225 1772
Step 3: determine whether the expected frequency satisfy assumptions 1 ,2, and 3. If they do not, this procedure should not be used.
1. all expected frequencies are greater than 12. non of the expected frequency less than 5 (0%
of cell have expected frequency less than 5)3. the two variable are categorical
Step 4 : decide on the significance level
Step 5: compute the value of the test statistics
Where E and O represent expected and observed frequencies, respectively.
Step 6 : the critical value is with df =(r-1)(c-1), where r and c are the number of categories in each variables , use table to fined the critical value.
120
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
R=4, c=3, so df=6, and from table
Step 7: if the value of the test statistics falls in the rejection region, or the P-Value reject H0 , otherwise, do not reject H0.
which is fail in rejection region, thus we reject H0 And the p-value < 0.05 so thus we reject H0
Step 7: interpret the results of the hypothesis test.
At 5% significance level, the data provide sufficient evidence to conclude that there is association between martial status and alcohol consumption.
use SPSS programuse the SPSS program to perform the hypothesis in previous exampleSTEP 1: Enter The Data As Shown Below
121
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
122
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SPSS outputStep 3 : the result shown below
123
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
MARTIAL * DRINK Crosstabulation
67 213 74 354117.9 191.2 44.9 354.0
411 633 129 1173390.6 633.5 148.9 1173.0
85 51 7 14347.6 77.2 18.2 143.0
27 60 15 10234.0 55.1 13.0 102.0590 957 225 1772
590.0 957.0 225.0 1772.0
CountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected Count
single
married
widowed
divorced
MARTIAL
Total
abstain 1-60 over 60DRINK
Total
Chi-Square Tests
94.269a 6 .00093.096 6 .000
32.265 1 .000
1772
Pearson Chi-SquareLikelihood RatioLinear-by-LinearAssociationN of Valid Cases
Value dfAsymp. Sig.
(2-sided)
0 cells (.0%) have expected count less than 5. Theminimum expected count is 12.95.
a.
MARITAL
DivorcedWidowedMarriedSingle
Cou
nt
700
600
500
400
300
200
100
0
DRINKS
Abstain
1-60
Over 60
124
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Goodness of Fit Test(A Multinomial Population)
Set up the null and alternative hypotheses.2. Select a random sample and record the observed frequency, fi , for each of the k categories.3. Assuming H0 is true, compute the expected frequency, ei , in each category by multiplying the category probability by the sample size.4. Compute the value of the test statistic.
5. Reject H0 if (where a is the significance level and there are k - 1 degrees of
freedom). Example: Finger Lakes Homes (A)Finger Lakes Homes manufactures four models of prefabricated homes, a two-story colonial, a ranch, a split-level, and an A-frame. To help in production planning, management would like to determine if previous customer purchases indicate that there is a preference in the style selectedThe number of homes sold of each model for 100 sales over the past two years is shown below.
Model Colonial Ranch Split-Level A-Frame # Sold 30 20 35 15
• Notation
pC = population. proportion that purchase a colonialpR = population. proportion that purchase a ranchpS = population. proportion that purchase a split-levelpA = population. proportion that purchase an A-frame
• Hypotheses
H0: pC = pR = pS = pA = .25Ha: The population proportions are not
pC = .25, pR = .25, pS = .25, and pA = .25• Expected Frequencies
e1 = .25(100) = 25 e2 = .25(100) = 25 e3 = .25(100) = 25 e4 = .25(100) = 25
• Test Statistic
125
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
= 1 + 1 + 4 + 4 = 10
• Conclusion
= 10 > 7.81, so we reject the assumption there is no home style preference, at the .05 level of significance.
Use SPSS program
1. Enter the data as shown below
126
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
127
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Click ok
SPSS OUTPUT
Chi-Square TestMODEL
30 25.0 5.020 25.0 -5.035 25.0 10.015 25.0 -10.0
100
ColonialRanchSplit-LevelA-FrameTotal
Observed N Expected N Residual
Test Statistics
10.0003
.019
Chi-Square a
dfAsymp. Sig.
MODEL
0 cells (.0%) have expected frequencies less than5. The minimum expected cell frequency is 25.0.
a.
P-Value = p( >10)= 0.0186 < 0.05 so we reject Ho
Inferential methods in regression and correlation
128
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
* Correlations
Correlation Coefficient, r
The quantity r, called the linear correlation coefficient, measures the strength and the direction of a linear relationship between two variables. The linear correlation coefficient is sometimes referred to as the Pearson product moment correlation coefficient in honor of its developer Karl Pearson. The mathematical formula for computing r is:
where n is the number of pairs of data.
The value of r is such that -1 < r < +1. The + and – signs are used for positive linear correlations and negative linear correlations, respectively. Positive correlation: If x and y have a strong positive linear correlation, r is close to +1. An r value of exactly +1 indicates a perfect positive fit. Positive values indicate a relationship between x and y variables such that as values for x increases, values for y also increase. Negative correlation: If x and y have a strong negative linear correlation, r is close to -1. An r value of exactly -1 indicates a perfect negative fit. Negative values indicate a relationship between x and y such that as values for x increase, values for y decrease. No correlation: If there is no linear correlation or a weak linear correlation, r is
129
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
close to 0. A value near zero means that there is a random, nonlinear relationship between the two variables Note that r is a dimensionless quantity; that is, it does not depend on the units employed. A perfect correlation of ± 1 occurs only when the data points all lie exactly on a straight line. If r = +1, the slope of this line is positive. If r = -1, the slope of this line is negative
Person correlation test
t- distribution for a correlation test
with df= n-2 the null hypothesis versus
or or EXAMPLE:Table below show the age and price data for a sample of 11 orions, test at 5% significance level, do the data provide sufficient evidence to conclude that the age and price of orions are negatively linearly correlated.
car Age Price car Age Price1 5 85 7 6 662 4 103 8 6 953 6 70 9 2 1694 5 82 10 7 705 5 89 11 7 486 5 98
Solution
130
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
First: by hand1. the null and alternative hypotheses
2.Calculate the data as in table belowage(x) price(y)
x-square y-square x*y
5 85 25 7225 4254 103 16 10609 4126 70 36 4900 4205 82 25 6724 4105 89 25 7921 4455 98 25 9604 4906 66 36 4356 3966 95 36 9025 5702 169 4 28561 3387 70 49 4900 4907 48 49 2304 336
Tot=58 Tot=975 Tot=26 Tot=96129 Tot=4732
3.Substitute in the formula
4. Calculate the test statistics
5. Find the critical value from t distribution table at and degrees of freedom =11-2=9 in one tail(left tail)
131
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Tcritical = -1.836.decision : ttest lies in rejection region (-ttest < - tcritical) and the p-value = p(t<-7.249) = 0.00002 < 0.05
Interpret resultsSo we reject the null hypothesis means at 5% significance level, the data provide sufficient evidence to conclude that the age and price of orions are negatively linearly correlated
second: by using SPSS procedure
1.Enter the data , and plot Scatter plots
132
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
133
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Linear Regression
2 3 4 5 6 7
age
50
75
100
125
150
price = 195.47 + -20.26 * ageR-Square = 0.85
2. We find Pearson correlation by using SPSS as follows
134
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
SPSS Outputs
Correlations
1 -.924**. .000
11 11-.924** 1.000 .
11 11
Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N
AGE
PRICE
AGE PRICE
Correlation is significant at the 0.01 level(2-tailed).
**.
135
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Critical value of t = -1.833
The value of the test statistic falls in the rejection region , and the p-value = 0.0000244 < 0.05 so we reject H0
Interpret resultsat 5% significance level, the data provide sufficient evidence to conclude that the age and price of orions are negatively linearly correlated, prices for orions tend to decrease linearly with increasing age.
Spearman correlationSpearman's Rank Order Correlation using SPSSObjectives
The Spearman Rank Order Correlation coefficient, rs, is a non-parametric measure of the strength and direction of association that exists between two variables measured on at least an ordinal scale. It is denoted by the symbol rs (or the greek letter ,pronounced rho). The test is used for either ordinal variables or for interval data that has failed the assumptions necessary for conducting the Pearson's product-moment correlation.
Assumptions
136
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Variables are measured on an ordinal, interval or ratio Variables need NOT be normally distributed. This type of correlation is NOT very sensitive to outliers.
Example
A teacher is interested in those who do the best at English also do better in Maths (assessed by exam) students in English are also the best performers in Maths. She records the scores of her 10 students as they performed in end-of-year examinations for both English and Maths.
English 56 75 45 71 61 64 58 80 76 61Maths 66 70 40 60 65 56 59 77 67 63
Hypothesis :
First, create a table with four columns and label them as below:
English (mark) Maths (mark) Rank (English) Rank (maths) d d2
56 66 9 4 5 2575 70 3 2 1 145 40 10 10 0 071 60 4 7 3 962 65 6.5 5 1 164 56 5 9 4 1658 59 8 8 0 080 77 1 1 0 076 67 2 3 1 161 63 6.5 6 1 1
Where d = difference between ranks and d2 = difference squared.
We then calculate the following:
137
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
We then substitute this into the main equation with the other information as follows:
as n = 10. Hence, we have r of 0.67. This indicates a strong positive relationship between the ranks individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you ranked in English also, and vice versa.
How do you report a Spearman's correlation?
How you report a Spearman's correlation coefficient depends on whether or not you have determined the statistical significance of the coefficient. If you have simply run the Spearman correlation without any statistical significance tests then you are able to simple state the value of the coefficient as shown below:
Rs = 0.67However, if you have also run statistical significance tests then you need to include some more information as shown below:
at , where N = number of pairwise cases from spearman rank table
Decision
138
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Rs calculated(=0.67) > ( critical value)
So we reject HO
Conclusion:
There is a positive relationship between the ranks individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you ranked in English also, and vice versa .Note: when the number of pairwise cases are large (>30)
We can use z distribution, and the statistical text as :
= 0.67 *sqrt(10-1)= 2.01
Zcritical = 1.96 from z-table
Zcal =2.01> ztab = 1.96 so we reject HO
There is a positive relationship between the ranks individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you ranked in English also, and vice versa
Test Procedure in SPSS
1. Click Analyze > Correlate > Bivariate... on the menu system as shown below:
139
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Published with written permission from SPSS Inc, an IBM Company.
2. Transfer the variables "English_Mark" and "Maths_Mark" into the "Variables" box by dragging-and-dropping or by clicking the button. You will end up with a screen similar to the one below:
Published with written permission from SPSS Inc, an IBM Company.
140
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
3. Make sure that you uncheck the Pearson tick box (it is selected by default in SPSS) and check the Spearman tick box under the "Correlation Coefficients" group.
4. Click the button.
Output SPSS
You will be presented with 3 tables in output viewer under the title "Correlations" as below:
Correlations
1.000 .673*. .033
10 10.673* 1.000.033 .
10 10
Correlation CoefficientSig. (2-tailed)NCorrelation CoefficientSig. (2-tailed)N
ENGLISH
MATH
Spearman's rhoENGLISH MATH
Correlation is significant at the .05 level (2-tailed).*.
Published with written permission from SPSS Inc, an IBM Company.
The results are presented in a matrix such that, as can be seen, the correlations are replicated. Nevertheless, the table presents Spearman's Rank Order Correlation, its significance value and the sample size that the calculation was based on. In this example, we can see that Spearman's correlation coefficient, rs, is 0.669 and that this is statistically significant (P = 0.033).
Reporting the Output
In our example you might present the results are follows: A Spearman's Rank Order correlation was run to determine the relationship between 10 students' English and maths exam marks. There was a strong, positive correlation between English and maths marks, which was statistically significant
* Regression inference:
Assumptions for regression inferences
141
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
1-Population regression line: means that for each value x of the predictor variable , the conditional mean for the response variable is
2-Equal standard deviations ( homoscedasticity) : the conditional standard deviations of the response variable are the same for each values of the predictor variable
3-Normal distributions: for each values of the predictor variable , the condition distribution of the response variable are a normal distribution.
4-Independent observations : the observations of the response variable are independent of one another
Hypothesis test for the slope of the population regression line
Example: in table below, at 5% significance level, the data provide sufficient evidence to conclude that the age is useful as a linear predictor of price for orions ?
car Age Price car Age Price1 5 85 7 6 66
142
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
2 4 103 8 6 953 6 70 9 2 1694 5 82 10 7 705 5 89 11 7 486 5 98
Solution: ( age is not a useful as a linear predictor of price for orions)
( age is a useful as a linear predictor of price for orions)Age : independent (explanatory) variable Price : dependent ( response)variable Test statistic
(where Se is the Std. Error of the Estimate) where ( sum of
square of errors)
The critical value
143
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
The value of the test statistic falls in the rejection region , and the p-value = 0.000488 < 0.05 so we reject H0
Interpret the result in the hypothesis test: At 5% significance level, the data provide sufficient evidence to conclude that the( the slope of the population regression line is not 0 and hence that age is useful as a linear predictor of price for orions
SPSS procedure:
144
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Regression
145
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Variables Entered/Removedb
AGEa . EnterModel1
VariablesEntered
VariablesRemoved Method
All requested variables entered.a.
Dependent Variable: PRICEb.
Model Summary
.924a .853 .837 12.577Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), AGEa.
ANOVAb
8285.014 1 8285.014 52.380 .000a
1423.532 9 158.1709708.545 10
RegressionResidualTotal
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), AGEa.
Dependent Variable: PRICEb.
Coefficientsa
195.468 15.240 12.826 .000-20.261 2.800 -.924 -7.237 .000
(Constant)AGE
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: PRICEa.
Determine coefficients
RGRESSION LINE : PRICE = 195.468 – 20.261 * AGE
COEFFICINT OF DETERMINATION = 0.853
146
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
EXAMPLE
SOLUTION
147
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
148
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
AGE
76543210
PR
ICE
280
260
240
220
200
180
160
140
120
100
149
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
AGE
76543210
PRIC
E
280
260
240
220
200
180
160
140
120
100
Correlations
150
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Correlations
1 -.968**. .000
10 10-.968** 1.000 .
10 10
Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N
AGE
PRICE
AGE PRICE
Correlation is significant at the 0.01 level(2-tailed).
**.
Coefficientsa
291.602 11.433 25.506 .000-27.903 2.563 -.968 -10.887 .000
(Constant)AGE
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: PRICEa.
AGE
76543210
PRIC
E
280
260
240
220
200
180
160
140
120
100
151
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Model Summary
.968a .937 .929 14.247Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), AGEa.
152
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
153
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
AGE
76543210
Stan
dard
ized
Res
idua
l
2.0
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
AGE
76543210
Uns
tand
ardi
zed
Res
idua
l
30
20
10
0
-10
-20
-30
154
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Normal P-P Plot of Regression Standardized Residual
Dependent Variable: PRICE
Observed Cum Prob
1.00.75.50.250.00
Exp
ecte
d C
um P
rob
1.00
.75
.50
.25
0.00
155
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
Coefficientsa
291.602 11.433 25.506 .000-27.903 2.563 -.968 -10.887 .000
(Constant)AGE
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: PRICEa.
Coefficientsa
291.602 11.433 25.506 .000-27.903 2.563 -.968 -10.887 .000
(Constant)AGE
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: PRICEa.
156
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
157
Doing Data Analysis With SPSS Dr. Nafez M. Barakat
158