business research - regenesys business school...7.10.2 descriptive statistics • for data analysis,...

Post on 29-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

www.regenesys.co.za

Brad BellMarch / April 2017

MBA 9

BUSINESS RESEARCH

Session 4a

FREE STATS PACKAGES

• Excel “Analysis Toolpak” (free for Win)• Plus “real statistics” add-on for Excel (http://www.real-

statistics.com) • Dozens of “how to” tutorials

• Comprehensiveness = “r” (free)• https://cran.r-project.org/bin/windows/base/• https://cran.r-project.org/bin/macosx/

• Convenience = “PSPP” (free)• pspp-0101+20160401-snapshot-64bits-setup• https://sourceforge.net/projects/pspp4windows/files/2016-04-01/

REAL STATISTICS RESOURCE PACK

• Come to www.real-statistics.com and download their free resource pack

INSTALLATION

• Go to: www.real-statistics.com/free-download/real-statistics-resource-pack/

• Choose the version suitable for BOTH your operating system (e.g. Windows or Mac) AND your version of Excel (e.g. 2016, 2013, 2010, 2007, 2003, etc.)

• Make your “hidden” folders visible• Save the .xlam file to a suitable folder, e.g. for Windows use

C:\Users\user-name\AppData\Roaming\Microsoft\AddIns

• Install “Real Statistics” as an add-in (each time?)

MBA 9

BUSINESS RESEARCH

Class 4a – Section 7.10(a)Page 111

(Quantitative) Data Analysis

MODULE OUTLINE (2 of 2)

7.8 Sampling design

7.9 Planning your data collection

Practical: Data collection instrument

7.10 Data analysis: Quantitative

Practical: Statistics

7.10 Data analysis: Qualitative

Practical: Research proposal alignment table

Research proposal review day (voluntary)

Class 3a

Class 4a

Class 4bClass 5

Class 3b

7.10.1 INTRODUCTION

You need a “clean” data set:-• Ensure that there is no missing or incorrect data;• Organise the data to ensure that it is logical, correctly

captured and cross-tabulation is possible;• Endure that the data is linked to the correct questions;• Ensure that the data is collected as nominal, ordinal,

interval or ratio data; and• Validate the accuracy and reliability of the data captured

compared to the questionnaires.

7.10.2 DESCRIPTIVE STATISTICS

• For data analysis, you will begin with descriptive statistics• This may take the form of graphs and descriptive statistical• analysis techniques such as:-

• Measures of central tendency, e.g. the mean (average), mode (most common) and the median (midpoint of the data)

• Measures of dispersion, e.g. range (highest to lowest values), and standard deviation (how far scattered to the sides)

• These measures of central tendency and dispersion describe the “normality” of the data and thus form the basis for further (inferential) quantitative analysis.

GENERAL INTRODUCTION

• “The whole point of statistics is to quantify uncertainty”• The basis of most statistical analysis is the “normal” /

“Gaussian” / “bell” curve

THE “NORMAL” DISTRIBUTION

7.10.1 DESCRIPTIVE STATISTICAL ANALYSIS

SEM AND SD

• Standard Error (of the Mean)• Combines the SD with the sample size• In the units of the data• Gets smaller as data gets bigger

• Standard Deviation• Measures the amount of “scatter”• In the units of the data• Unpredictable changes as more data included

SKEWNESS AND KURTOSIS

Skewness

(mean shifts left or right)

Kurtosis

(peak is higher or flatter)

“NORMALITY”

There are two common but different calculations for kurtosis:-1. mu4/sigma4, which gives a result of “3” for a normal distribution;

and2. kappa4/kappa2-square, which gives “0” for a normal distribution.

The software used to compute this statistic, MS-Excel, uses the latter. Thus the values for skewness and kurtosis between -1.96 and +1.96 are considered acceptable in order to prove normal univariate distribution (George & Mallery, 2010; Gravetter & Wallnau, 2014).

• George, D., & Mallery, M. (2010). SPSS for Windows Step by Step: A Simple Guide and Reference (10th ed.). Boston, MA: Pearson.

• Gravetter, F., & Wallnau, L. (2014). Essentials of Statistics for the Behavioral Sciences (8th ed.). Belmont, CA: Wadsworth.

CONFIDENCE LEVEL (OF THE MEAN)

• We can be (95%) confident that:-• The true mean of the whole population

• is within plus or minus• (X of the same units as the data) as

• The calculated mean of this sample

7.10.3 INFERENTIAL STATISTICS

• We are looking for differences

• We should usually assume that there are no differences• This is the “null hypothesis” (Ho)

• Then we can guess that there might possibly be a difference

• This is the (first) “alternative hypothesis” (Ha)

CONFIDENCE LEVEL (“p”)

• How confident do we need to be before we can reject the null hypothesis?

• Assumption: Up to 95% of the time, the appearance of any difference can be explained by chance

• We must reach the final 5% on the confidence scale to be able to reject the null hypothesis

• This final 5% (percent) = 0.05 (decimal fraction)• So the confidence level (often “p”) must be equal to or less

than 0.05• Especially using the “2-tailed” test

STATISTICAL TESTS

• We will look at 3 basic statistical families:-(Categorical) data in a table• E.g. What is your family situation? How much stress do you

experience?• Chi-squared test of independenceDifferences between group averages• E.g. Productivity scores of workers in traditional v open plan

offices?• T-test and ANOVASimilarities between characteristics• E.g. Emotional intelligence and rank seniority• Cronbach Alpha, Correlation and Multiple Regression

1ST “FAMILY”

• Categorical data“CHI SQUARED TEST OF INDEPENDENCE”

• Used to measure the distribution of “categorical data” and test the null hypothesis that it is equally distributed, i.e. there are no patterns of relationship

• You need your data in a table, with one category down the side and the other category along the top

• Note: The Greek letter χ is often represented by “chi” or occasionally “x”

HOW TO FORMAT YOUR DATA (TABLE)

• Your “actual” data must be in a table, with XYZ down the side and ABC categories along the top, etc.

HOW TO DO IT

• Add-ins → Real Statistics → Data Analysis Tools → Misc → Chi-square test of independence

HOW TO INTERPRET IT (a)

• It will give you a table of “expected values” based on the assumption that everything is equal (ignore it)

HOW TO INTERPRET IT (b)

• Focus on Pearson’s1. Is your “chi-sq” bigger than “x-crit”? (remember “chi” = “x”)2. Or: Is your “p-value” smaller than 0.05?3. Or: Does “sig” say “yes”?

SO WHAT?

• So now you have proven that the differences cannot be explained in terms of chance alone

• In other words, there is a statistically significant relationship between the categories of X and Y

• Note: You have NOT proven that change in one category CAUSES change in the other category

• You now need to hypothesise possible explanatory factors contributing to this relationship

2ND FAMILY (a)

• Interval / ratio data• Differences between group averages• T-Test and Anova• If only 2 groups …

T-TEST

• T-Test is for testing differences between group averages• It can be the same group measured twice (paired t-test) or

two different groups

HOW TO FORMAT YOUR DATA (COLUMNS)

HOW TO DO IT

• Add-ins → Real Statistics → Data Analysis Tools → Misc → T-Test

HOW TO INTERPRET IT

Focus on Two Tail1. Is your “t” (ignore + or -) bigger than “t-crit”? 2. Or: Is your “p-value” smaller than 0.05?3. Or: Does “sig” say “yes”?

SO WHAT?

• So now you have proven that the difference in average between the two groups cannot be explained in terms of chance alone

• In other words, there is a statistically significant difference between the averages of groups A and B

• Notes: If this was an EXPERIMENT, in which you controlled all other relevant variables, then you have proven that the intervention applied to the experimental group has caused the change in average

• If this was NOT an experiment, then you have not proven causation, only the existence of a significant difference

• You now need to hypothesise possible explanatory factors contributing to this difference

2ND FAMILY (b)

• Interval / ratio data• Differences between group averages• T-Test and Anova• If 3 or more groups …

ANOVA

• This is like a multiple t-test at one time• Comparing the means (averages) and standard variations

between several groups• Used to test the null hypothesis that there is no real

difference between the groups’ averages

HOW TO FORMAT YOUR DATA (COLUMNS)

• Factor along the top; respondent’s scores coming down

HOW TO DO IT

• Add-ins → Real Statistics → Data Analysis Tools → Anova→ One factor Anova (one factor for now)

HOW TO INTERPRET IT

• Is your “F” bigger than “F crit”?• Or is your “P-value” smaller than 0.05? (remember “E” =

exponent used for very tiny numbers when negative)

SO WHAT?

• So now you have proven that the difference in averages between the groups cannot be explained in terms of chance alone

• In other words, there are statistically significant differences between the averages of the groups

• Notes: If this was NOT an experiment, then you have not proven causation, only the existence of a significant difference

• You now need to hypothesise possible explanatory factors contributing to this differences

3RD FAMILY (a)

• Interval / ratio data• Similarities between characteristics• Cronbach Alpha, Correlation and Multiple Regression• If testing internal consistency of a questionnaire …

CRONBACH’S ALPHA

• This tests the internal consistency between respondents’ answers to a number of questions purporting to be focusing on the same issue (a “multi-item summated measure”)

HOW TO FORMAT YOUR DATA (COLUMNS)

• Questions along the top; respondent’s answers coming down

HOW TO DO IT

• Add-ins → Real Statistics → Data Analysis Tools → Misc → Reliability Testing

HOW TO INTERPRET IT

• Is your “Alpha” between 0.60 – 0.90?• If not, can your Alpha become between 0.60 – 0.90 if you

exclude one question?

REFERENCE• The value for Cronbach’s alpha that indicates an

acceptable level of internal consistency between multiple items in a summated measure should be at least 0.60 (Sekaran, 2006, 311), although the slightly higher alpha of 0.65 is commonly used. The value should ideally not exceed 0.90 (Streiner, 2003); if it does, it suggests redundancies are included (i.e. different questions measuring exactly the same thing twice).

• Sekaran, U. (2006). Research methods for business: A skill building approach. John Wiley & Sons. New York.

• Streiner D. (2003). Starting at the beginning: an introduction to coefficient alpha and internal consistency. Journal of Personality Assessment. Vol 80, pages 99 – 103.

SO WHAT?

• So now you have proven that the various questions in that section of your questionnaire are consistently measuring the same thing

• You have to recalculate Cronbach’s Alpha for each different section of your questionnaire that focuses on different issues (see “content validity”)

• Notes:• Reliability = how consistently you could repeat the data

collection and obtain similar results• Cronbach’s Alpha establishes prima facie “reliability”, but

has no bearing on *validity

VALIDITYValidity is established by confirming:-• “Construct” validity = are you actually measuring what you

claim to be measuring? (Cronbach & Meehl, 1955)• – Panel of experts agrees / or your instrument is adapted from

existing, validated measure• Cronbach, L. J. and Meehl, P. E. (1955). Construct validity in

psychological tests. Psychological Bulletin. 52 (4): 281–302

• “Content” validity = does the range of sections within the questionnaire systematically cover all areas / dimensions of [the domain] being measured? (Anastasi & Urbina, 1997: 114)

• – Established by comparing theory (e.g. “EQ has four dimensions”) to questionnaire structure

• Anastasi, A. and Urbina, S. (1997). Psychological testing (7th ed). Upper Saddle River, NJ: Prentice Hall.

3RD FAMILY (b)

• Interval / ratio data• Similarities between characteristics• Cronbach Alpha, Correlation and Multiple Regression• If testing relationship between 2 variables …

CORRELATION

• Correlation is used to measure whether one characteristic (variable) of a group of people is found in high or low proportions in relation to another characteristic (variable)

HOW TO FORMAT YOUR DATA (COLUMNS)

• Characteristics (variables) along the top; respondent’s scores coming down

HOW TO DO IT

• Add-ins → Real Statistics → Data Analysis Tools → Misc → Correlation Tests

HOW TO INTERPRET IT

• Is your “Pearson Corr” close to -1 or +1?

• Is your “P-value”smaller than 0.05?

VISUAL RELATIONSHIP(“scatter plot with trend line” in Excel)

SO WHAT?

• So now you have proven that the appearance that the degree of presence or absence of one characteristic (variable) associated with the degree of presence or absence of another characteristic (variable) cannot be explained in terms of chance alone

• In other words, there are statistically significant relationship between the two characteristics (variables)

• Notes: You have not proven causation, only the existence of a significant relationship

• You now need to hypothesise possible explanatory factors contributing to this relationship

3RD FAMILY (c)

• Interval / ratio data• Similarities between characteristics• Cronbach Alpha, Correlation and Multiple Regression• If testing relationship between 3 or more variables …

MULTIPLE REGRESSION

• This is performing multiple correlations at one time• Comparing the strength and direction of relationships

between more than 2 variables at one time

HOW TO FORMAT YOUR DATA (COLUMNS)

• Characteristics (variables) along the top; respondent’s scores coming down

HOW TO DO IT

• Add-ins → Real Statistics → Data Analysis Tools → Reg → Multiple Linear Regression

• Use “Input Range X” for all your independent variables (that act on the dependent variable)

• Use “Input Range Y” for your one and only dependent variable (that is acted upon by a number of independent variables)

HOW TO INTERPRET IT (a)

• “R Square” = the “percent” (e.g. 0.92 = 92%) of the changes in the dependent variable that can be explained by changes in the combination of all independent variables

HOW TO INTERPRET IT (b)

1. Is your “p-value” smaller than 0.05? (remember “E” = exponent used for very tiny numbers when negative)

2. Or: Does “sig” say “yes”?

HOW TO INTERPRET IT (c)

• “P-values” if you want to narrow down many variables (and keep the most significant)

• “Coeff’s” if you want to derive a weighted formula (e.g. if you change one unit of this independent variable, then how much will it change the dependent variable?)

SO WHAT?

• So now you have proven that the appearance that the degree of presence or absence of a combination of characteristics (variables) associated with the degree of presence or absence of another characteristic (variable) cannot be explained in terms of chance alone

• In other words, there are statistically significant relationship between the characteristics (variables)

• Notes: You have not proven causation, only the existence of a significant relationship with implied causation

• You now need to hypothesise possible explanatory factors contributing to this relationship

EXAMPLE (JANET GENIS, 2016)

• A multiple regression analysis was performed applying overall pass rate as the criterion, and fifteen factors as predictors to determine if overall pass rate scores could be predicted as a function of general scores of these factors. These predictor factors are: school management and leadership (SML), curriculum management (CM), community and parental involvement (CPI), teaching skills (TS), safety and security (SS), discipline (D), welfare (W), extracurricular activities (EC), building structure (BS), classroom resources (CR), electricity (E), water supply (WS), ablutions (A), outdoor facilities (OF) and fencing (F).

• The analysis was found to be statistically significantF(15,181) = 460,0500239, significance F = 8,7225E-135, indicating that these factors are good predictors of overall pass rate. This multiple regression accounted for 96.69% of the variability, as indexed by the adjusted R2 statistic.

• Individual 𝑝-values for each of these factors indicate five factors that have an individual 𝒑-value < 0.05. This implies a 95% confidence interval – the lower the 𝑝-value, the stronger the significance. These factors are school management and leadership (SML), curriculum management (CM), community and parental involvement (CPI), classroom resources (CR) and outdoor facilities (OF).

FULL MULTIPLE REGRESSION

• A stepped multiple regression analysis was performed with only these five factors.

• Again the analysis was found to be statistically significant F(5,191 )= 1350,262733, significance F = 2,3889E-146, indicating that these factors are very good predictors of overall pass rate.

• This multiple regression accounted for 96.67% of the variability, as indexed by the adjusted R2 statistic. The significance increased greatly while the variability changed very little. This time the individual 𝑝-values for each of these five factors had an individual 𝑝-value < 0,01.

• Overall pass rate could thus be predicted as a function of the general score of these five factors.

STEPPED REGRESSION / REGRESSION RE-RUN

REGRESSION EQUATION

• The regression equation was found to be: Y =(5.99*SML)+ (4.04*CPI) + (3.19*OF) + (2.67*CM)+(1.36*CR). Thus, for a unit change in each of these factors we can predict a 17.25 unit change in Y (pass rate).

• This means that there is a strong multiplier effect amongst these five factors; for example, for every unit change within school management and leadership, an improvement of 5.99 units can be expected in the pass rate. Therefore, when combining every unit change in each of these five factors, a change of 17.25 units in the pass rate can be predicted.

www.regenesys.co.za

https://www.youtube.com/watch?v=ruliuan0u3w

after this video …

Endof the unit

VIDEO: https://www.youtube.com/watch?v=ruliuan0u3w

*PRACTICAL 1 of 1:RESEARCH PROPOSAL ALIGNMENT TABLE

Title:Ch 1 Ch 2 Ch 3

Research Questions

Key Literature Sample

Data Collection Instrument

Method of Data

Analysis

www.regenesys.co.za

ENDOF

THEUNIT

Congratulations!Nice going … !!

top related