www.regenesys.co.za
Brad BellMarch / April 2017
MBA 9
BUSINESS RESEARCH
Session 4a
FREE STATS PACKAGES
• Excel “Analysis Toolpak” (free for Win)• Plus “real statistics” add-on for Excel (http://www.real-
statistics.com) • Dozens of “how to” tutorials
• Comprehensiveness = “r” (free)• https://cran.r-project.org/bin/windows/base/• https://cran.r-project.org/bin/macosx/
• Convenience = “PSPP” (free)• pspp-0101+20160401-snapshot-64bits-setup• https://sourceforge.net/projects/pspp4windows/files/2016-04-01/
REAL STATISTICS RESOURCE PACK
• Come to www.real-statistics.com and download their free resource pack
INSTALLATION
• Go to: www.real-statistics.com/free-download/real-statistics-resource-pack/
• Choose the version suitable for BOTH your operating system (e.g. Windows or Mac) AND your version of Excel (e.g. 2016, 2013, 2010, 2007, 2003, etc.)
• Make your “hidden” folders visible• Save the .xlam file to a suitable folder, e.g. for Windows use
C:\Users\user-name\AppData\Roaming\Microsoft\AddIns
• Install “Real Statistics” as an add-in (each time?)
MBA 9
BUSINESS RESEARCH
Class 4a – Section 7.10(a)Page 111
(Quantitative) Data Analysis
MODULE OUTLINE (2 of 2)
7.8 Sampling design
7.9 Planning your data collection
Practical: Data collection instrument
7.10 Data analysis: Quantitative
Practical: Statistics
7.10 Data analysis: Qualitative
Practical: Research proposal alignment table
Research proposal review day (voluntary)
Class 3a
Class 4a
Class 4bClass 5
Class 3b
7.10.1 INTRODUCTION
You need a “clean” data set:-• Ensure that there is no missing or incorrect data;• Organise the data to ensure that it is logical, correctly
captured and cross-tabulation is possible;• Endure that the data is linked to the correct questions;• Ensure that the data is collected as nominal, ordinal,
interval or ratio data; and• Validate the accuracy and reliability of the data captured
compared to the questionnaires.
7.10.2 DESCRIPTIVE STATISTICS
• For data analysis, you will begin with descriptive statistics• This may take the form of graphs and descriptive statistical• analysis techniques such as:-
• Measures of central tendency, e.g. the mean (average), mode (most common) and the median (midpoint of the data)
• Measures of dispersion, e.g. range (highest to lowest values), and standard deviation (how far scattered to the sides)
• These measures of central tendency and dispersion describe the “normality” of the data and thus form the basis for further (inferential) quantitative analysis.
GENERAL INTRODUCTION
• “The whole point of statistics is to quantify uncertainty”• The basis of most statistical analysis is the “normal” /
“Gaussian” / “bell” curve
THE “NORMAL” DISTRIBUTION
7.10.1 DESCRIPTIVE STATISTICAL ANALYSIS
SEM AND SD
• Standard Error (of the Mean)• Combines the SD with the sample size• In the units of the data• Gets smaller as data gets bigger
• Standard Deviation• Measures the amount of “scatter”• In the units of the data• Unpredictable changes as more data included
SKEWNESS AND KURTOSIS
Skewness
(mean shifts left or right)
Kurtosis
(peak is higher or flatter)
“NORMALITY”
There are two common but different calculations for kurtosis:-1. mu4/sigma4, which gives a result of “3” for a normal distribution;
and2. kappa4/kappa2-square, which gives “0” for a normal distribution.
The software used to compute this statistic, MS-Excel, uses the latter. Thus the values for skewness and kurtosis between -1.96 and +1.96 are considered acceptable in order to prove normal univariate distribution (George & Mallery, 2010; Gravetter & Wallnau, 2014).
• George, D., & Mallery, M. (2010). SPSS for Windows Step by Step: A Simple Guide and Reference (10th ed.). Boston, MA: Pearson.
• Gravetter, F., & Wallnau, L. (2014). Essentials of Statistics for the Behavioral Sciences (8th ed.). Belmont, CA: Wadsworth.
CONFIDENCE LEVEL (OF THE MEAN)
• We can be (95%) confident that:-• The true mean of the whole population
• is within plus or minus• (X of the same units as the data) as
• The calculated mean of this sample
7.10.3 INFERENTIAL STATISTICS
• We are looking for differences
• We should usually assume that there are no differences• This is the “null hypothesis” (Ho)
• Then we can guess that there might possibly be a difference
• This is the (first) “alternative hypothesis” (Ha)
CONFIDENCE LEVEL (“p”)
• How confident do we need to be before we can reject the null hypothesis?
• Assumption: Up to 95% of the time, the appearance of any difference can be explained by chance
• We must reach the final 5% on the confidence scale to be able to reject the null hypothesis
• This final 5% (percent) = 0.05 (decimal fraction)• So the confidence level (often “p”) must be equal to or less
than 0.05• Especially using the “2-tailed” test
STATISTICAL TESTS
• We will look at 3 basic statistical families:-(Categorical) data in a table• E.g. What is your family situation? How much stress do you
experience?• Chi-squared test of independenceDifferences between group averages• E.g. Productivity scores of workers in traditional v open plan
offices?• T-test and ANOVASimilarities between characteristics• E.g. Emotional intelligence and rank seniority• Cronbach Alpha, Correlation and Multiple Regression
1ST “FAMILY”
• Categorical data“CHI SQUARED TEST OF INDEPENDENCE”
• Used to measure the distribution of “categorical data” and test the null hypothesis that it is equally distributed, i.e. there are no patterns of relationship
• You need your data in a table, with one category down the side and the other category along the top
• Note: The Greek letter χ is often represented by “chi” or occasionally “x”
HOW TO FORMAT YOUR DATA (TABLE)
• Your “actual” data must be in a table, with XYZ down the side and ABC categories along the top, etc.
HOW TO DO IT
• Add-ins → Real Statistics → Data Analysis Tools → Misc → Chi-square test of independence
HOW TO INTERPRET IT (a)
• It will give you a table of “expected values” based on the assumption that everything is equal (ignore it)
HOW TO INTERPRET IT (b)
• Focus on Pearson’s1. Is your “chi-sq” bigger than “x-crit”? (remember “chi” = “x”)2. Or: Is your “p-value” smaller than 0.05?3. Or: Does “sig” say “yes”?
SO WHAT?
• So now you have proven that the differences cannot be explained in terms of chance alone
• In other words, there is a statistically significant relationship between the categories of X and Y
• Note: You have NOT proven that change in one category CAUSES change in the other category
• You now need to hypothesise possible explanatory factors contributing to this relationship
2ND FAMILY (a)
• Interval / ratio data• Differences between group averages• T-Test and Anova• If only 2 groups …
T-TEST
• T-Test is for testing differences between group averages• It can be the same group measured twice (paired t-test) or
two different groups
HOW TO FORMAT YOUR DATA (COLUMNS)
HOW TO DO IT
• Add-ins → Real Statistics → Data Analysis Tools → Misc → T-Test
HOW TO INTERPRET IT
Focus on Two Tail1. Is your “t” (ignore + or -) bigger than “t-crit”? 2. Or: Is your “p-value” smaller than 0.05?3. Or: Does “sig” say “yes”?
SO WHAT?
• So now you have proven that the difference in average between the two groups cannot be explained in terms of chance alone
• In other words, there is a statistically significant difference between the averages of groups A and B
• Notes: If this was an EXPERIMENT, in which you controlled all other relevant variables, then you have proven that the intervention applied to the experimental group has caused the change in average
• If this was NOT an experiment, then you have not proven causation, only the existence of a significant difference
• You now need to hypothesise possible explanatory factors contributing to this difference
2ND FAMILY (b)
• Interval / ratio data• Differences between group averages• T-Test and Anova• If 3 or more groups …
ANOVA
• This is like a multiple t-test at one time• Comparing the means (averages) and standard variations
between several groups• Used to test the null hypothesis that there is no real
difference between the groups’ averages
HOW TO FORMAT YOUR DATA (COLUMNS)
• Factor along the top; respondent’s scores coming down
HOW TO DO IT
• Add-ins → Real Statistics → Data Analysis Tools → Anova→ One factor Anova (one factor for now)
HOW TO INTERPRET IT
• Is your “F” bigger than “F crit”?• Or is your “P-value” smaller than 0.05? (remember “E” =
exponent used for very tiny numbers when negative)
SO WHAT?
• So now you have proven that the difference in averages between the groups cannot be explained in terms of chance alone
• In other words, there are statistically significant differences between the averages of the groups
• Notes: If this was NOT an experiment, then you have not proven causation, only the existence of a significant difference
• You now need to hypothesise possible explanatory factors contributing to this differences
3RD FAMILY (a)
• Interval / ratio data• Similarities between characteristics• Cronbach Alpha, Correlation and Multiple Regression• If testing internal consistency of a questionnaire …
CRONBACH’S ALPHA
• This tests the internal consistency between respondents’ answers to a number of questions purporting to be focusing on the same issue (a “multi-item summated measure”)
HOW TO FORMAT YOUR DATA (COLUMNS)
• Questions along the top; respondent’s answers coming down
HOW TO DO IT
• Add-ins → Real Statistics → Data Analysis Tools → Misc → Reliability Testing
HOW TO INTERPRET IT
• Is your “Alpha” between 0.60 – 0.90?• If not, can your Alpha become between 0.60 – 0.90 if you
exclude one question?
REFERENCE• The value for Cronbach’s alpha that indicates an
acceptable level of internal consistency between multiple items in a summated measure should be at least 0.60 (Sekaran, 2006, 311), although the slightly higher alpha of 0.65 is commonly used. The value should ideally not exceed 0.90 (Streiner, 2003); if it does, it suggests redundancies are included (i.e. different questions measuring exactly the same thing twice).
• Sekaran, U. (2006). Research methods for business: A skill building approach. John Wiley & Sons. New York.
• Streiner D. (2003). Starting at the beginning: an introduction to coefficient alpha and internal consistency. Journal of Personality Assessment. Vol 80, pages 99 – 103.
SO WHAT?
• So now you have proven that the various questions in that section of your questionnaire are consistently measuring the same thing
• You have to recalculate Cronbach’s Alpha for each different section of your questionnaire that focuses on different issues (see “content validity”)
• Notes:• Reliability = how consistently you could repeat the data
collection and obtain similar results• Cronbach’s Alpha establishes prima facie “reliability”, but
has no bearing on *validity
VALIDITYValidity is established by confirming:-• “Construct” validity = are you actually measuring what you
claim to be measuring? (Cronbach & Meehl, 1955)• – Panel of experts agrees / or your instrument is adapted from
existing, validated measure• Cronbach, L. J. and Meehl, P. E. (1955). Construct validity in
psychological tests. Psychological Bulletin. 52 (4): 281–302
• “Content” validity = does the range of sections within the questionnaire systematically cover all areas / dimensions of [the domain] being measured? (Anastasi & Urbina, 1997: 114)
• – Established by comparing theory (e.g. “EQ has four dimensions”) to questionnaire structure
• Anastasi, A. and Urbina, S. (1997). Psychological testing (7th ed). Upper Saddle River, NJ: Prentice Hall.
3RD FAMILY (b)
• Interval / ratio data• Similarities between characteristics• Cronbach Alpha, Correlation and Multiple Regression• If testing relationship between 2 variables …
CORRELATION
• Correlation is used to measure whether one characteristic (variable) of a group of people is found in high or low proportions in relation to another characteristic (variable)
HOW TO FORMAT YOUR DATA (COLUMNS)
• Characteristics (variables) along the top; respondent’s scores coming down
HOW TO DO IT
• Add-ins → Real Statistics → Data Analysis Tools → Misc → Correlation Tests
HOW TO INTERPRET IT
• Is your “Pearson Corr” close to -1 or +1?
• Is your “P-value”smaller than 0.05?
VISUAL RELATIONSHIP(“scatter plot with trend line” in Excel)
SO WHAT?
• So now you have proven that the appearance that the degree of presence or absence of one characteristic (variable) associated with the degree of presence or absence of another characteristic (variable) cannot be explained in terms of chance alone
• In other words, there are statistically significant relationship between the two characteristics (variables)
• Notes: You have not proven causation, only the existence of a significant relationship
• You now need to hypothesise possible explanatory factors contributing to this relationship
3RD FAMILY (c)
• Interval / ratio data• Similarities between characteristics• Cronbach Alpha, Correlation and Multiple Regression• If testing relationship between 3 or more variables …
MULTIPLE REGRESSION
• This is performing multiple correlations at one time• Comparing the strength and direction of relationships
between more than 2 variables at one time
HOW TO FORMAT YOUR DATA (COLUMNS)
• Characteristics (variables) along the top; respondent’s scores coming down
HOW TO DO IT
• Add-ins → Real Statistics → Data Analysis Tools → Reg → Multiple Linear Regression
• Use “Input Range X” for all your independent variables (that act on the dependent variable)
• Use “Input Range Y” for your one and only dependent variable (that is acted upon by a number of independent variables)
HOW TO INTERPRET IT (a)
• “R Square” = the “percent” (e.g. 0.92 = 92%) of the changes in the dependent variable that can be explained by changes in the combination of all independent variables
HOW TO INTERPRET IT (b)
1. Is your “p-value” smaller than 0.05? (remember “E” = exponent used for very tiny numbers when negative)
2. Or: Does “sig” say “yes”?
HOW TO INTERPRET IT (c)
• “P-values” if you want to narrow down many variables (and keep the most significant)
• “Coeff’s” if you want to derive a weighted formula (e.g. if you change one unit of this independent variable, then how much will it change the dependent variable?)
SO WHAT?
• So now you have proven that the appearance that the degree of presence or absence of a combination of characteristics (variables) associated with the degree of presence or absence of another characteristic (variable) cannot be explained in terms of chance alone
• In other words, there are statistically significant relationship between the characteristics (variables)
• Notes: You have not proven causation, only the existence of a significant relationship with implied causation
• You now need to hypothesise possible explanatory factors contributing to this relationship
EXAMPLE (JANET GENIS, 2016)
• A multiple regression analysis was performed applying overall pass rate as the criterion, and fifteen factors as predictors to determine if overall pass rate scores could be predicted as a function of general scores of these factors. These predictor factors are: school management and leadership (SML), curriculum management (CM), community and parental involvement (CPI), teaching skills (TS), safety and security (SS), discipline (D), welfare (W), extracurricular activities (EC), building structure (BS), classroom resources (CR), electricity (E), water supply (WS), ablutions (A), outdoor facilities (OF) and fencing (F).
• The analysis was found to be statistically significantF(15,181) = 460,0500239, significance F = 8,7225E-135, indicating that these factors are good predictors of overall pass rate. This multiple regression accounted for 96.69% of the variability, as indexed by the adjusted R2 statistic.
• Individual 𝑝-values for each of these factors indicate five factors that have an individual 𝒑-value < 0.05. This implies a 95% confidence interval – the lower the 𝑝-value, the stronger the significance. These factors are school management and leadership (SML), curriculum management (CM), community and parental involvement (CPI), classroom resources (CR) and outdoor facilities (OF).
FULL MULTIPLE REGRESSION
• A stepped multiple regression analysis was performed with only these five factors.
• Again the analysis was found to be statistically significant F(5,191 )= 1350,262733, significance F = 2,3889E-146, indicating that these factors are very good predictors of overall pass rate.
• This multiple regression accounted for 96.67% of the variability, as indexed by the adjusted R2 statistic. The significance increased greatly while the variability changed very little. This time the individual 𝑝-values for each of these five factors had an individual 𝑝-value < 0,01.
• Overall pass rate could thus be predicted as a function of the general score of these five factors.
STEPPED REGRESSION / REGRESSION RE-RUN
REGRESSION EQUATION
• The regression equation was found to be: Y =(5.99*SML)+ (4.04*CPI) + (3.19*OF) + (2.67*CM)+(1.36*CR). Thus, for a unit change in each of these factors we can predict a 17.25 unit change in Y (pass rate).
• This means that there is a strong multiplier effect amongst these five factors; for example, for every unit change within school management and leadership, an improvement of 5.99 units can be expected in the pass rate. Therefore, when combining every unit change in each of these five factors, a change of 17.25 units in the pass rate can be predicted.
www.regenesys.co.za
https://www.youtube.com/watch?v=ruliuan0u3w
after this video …
Endof the unit
VIDEO: https://www.youtube.com/watch?v=ruliuan0u3w
*PRACTICAL 1 of 1:RESEARCH PROPOSAL ALIGNMENT TABLE
Title:Ch 1 Ch 2 Ch 3
Research Questions
Key Literature Sample
Data Collection Instrument
Method of Data
Analysis
www.regenesys.co.za
ENDOF
THEUNIT
Congratulations!Nice going … !!