basic statistical tools for research
TRANSCRIPT
-
8/4/2019 Basic Statistical Tools for Research
1/53
by
Benjamin L. Marciano, Jr.
-
8/4/2019 Basic Statistical Tools for Research
2/53
ObjectivesyUnderstand the statistical nature of
research data
yIdentify approaches in quantitativeresearch planning (data collection,
organization and analysis)yIdentify appropriate statistical
techniques for a given study design
-
8/4/2019 Basic Statistical Tools for Research
3/53
Consid rations inChoosing
Statistical Tools
y1. Level of Measurement
y2. Nature of StatisticalRelationship
y3. Parametric versusNonparametric Test
-
8/4/2019 Basic Statistical Tools for Research
4/53
Levels of MeasurementyNominal numbers are just categories
y
Ordinal ranks, hierarchy, orderyInterval equally spaced scores; no
mathematical concept of multiplicity;
no true zeroyRatio highest level of measurement
-
8/4/2019 Basic Statistical Tools for Research
5/53
a ure o a s ca
Relationship( epen s on objective
of the stu y)
yAssociation/Correlation
yComparing groups or treatmenteffects
yPredicting a value of an attribute ofinterest
yTesting the effect of several factors
on a response
-
8/4/2019 Basic Statistical Tools for Research
6/53
Parametric vs. NonparametricChoice relies on
y
the level of measurementyassumption of normality
ysample size
Note: Parametric tests are generallymore powerful than nonparametrictests.
-
8/4/2019 Basic Statistical Tools for Research
7/53
Probability and Non-probability
Samplingy Probability procedure wherein every
Sampling element of the population
is given a (known) nonzerochance of being selected in
the sample
y Nonprobability procedure wherein not all
Sampling the elements in thepopulation are given a chanceof being included
in the sample
-
8/4/2019 Basic Statistical Tools for Research
8/53
Issuesy Choice relies on
Nature of measurement
Variation in the populationTolerable margin of error
y Treatment of Heterogeneity
Stratification
Clustering
Multi-staging
y Formula
-
8/4/2019 Basic Statistical Tools for Research
9/53
Testing Statistical HypothesesThe Hypotheses
y
Null hypothesis (Ho) -the hypothesis ofno difference or no effect
yAlternative hypothesis (Ha) -the
operational statement that is acceptedin case the null hypothesis is rejected
-
8/4/2019 Basic Statistical Tools for Research
10/53
Testing Statistical HypothesesLevel of Significance (alpha)
y the size of the risk (0 < alpha< 1) of erroneously
rejecting Ho that the researcher is willing to makey The choice of alpha usually depends on the
consequences associated with erroneouslyrejecting Ho.
y alpha=0.01 or less => very serious error
y alpha=0.05 => moderate
y alpha=0.10 => not too serious error
-
8/4/2019 Basic Statistical Tools for Research
11/53
A Summary ofPossible Decisions in
Hypothesis TestingState of Nature
(True Situation)Ho is true Ho is false
Decision(Data says) Reject Ho TYPE Ierror CORRECTdecision
chance of chance ofoccurrence=alpha occurrence= 1 - beta(level of significance) (power of the test)
Do not reject Ho CORRECTdecision TYPEII errorchance of chance of occurrence= 1 - alpha occurrence= beta
-
8/4/2019 Basic Statistical Tools for Research
12/53
Testing Statistical HypothesesThe p-value
y the smallest level of significance at which Ho will
be rejected based on the information contained inthe sample
y Alternative form of decision rule based on the p-value:
Reject Ho if the p-value is less than or equal to thelevel of significance (alpha).
y Remember: If p is low, Ho must go!
-
8/4/2019 Basic Statistical Tools for Research
13/53
-
8/4/2019 Basic Statistical Tools for Research
14/53
DESCRIPTIVE METHODSDescribing and Summarizing
A Set of MeasurementsyPresentation of Tables
yConstruction of Graphs
yComputation of SummaryMeasures
-
8/4/2019 Basic Statistical Tools for Research
15/53
How to escribe atayAverages describe the central value
Issue: Which average to use?
yVariation describes extent of dispersionIssue: Absolute or comparative dispersion?
ySkewness describes degree of asymmetry
Where in the range of values do datacluster?
yPercentiles identify markers or thresholds
-
8/4/2019 Basic Statistical Tools for Research
16/53
Chi-Square TestyThe chi-square test determines the
association between two (categorical)
variables set in a contingency table.yGenerally regarded as a nonparametric test
though no parametric counterpart is gaining
popularity.yThe Fisher Exact Test is an alternative to this
test for 2x2 contingency tables.
-
8/4/2019 Basic Statistical Tools for Research
17/53
Chi-Square TestLow Income Middle Income High Income
(-) attitude 31 29 27
(+) attitude 48 93 165Total 79 122 192
The null and alternative hypotheses are-
y Ho: Socioeconomic status and attitude areindependent.
y Ha: The 2 variables are associated.
-
8/4/2019 Basic Statistical Tools for Research
18/53
Correlation AnalysisyCorrelation means the degree of linear
association between two measurements.
yThe most common correlation measure isthe Pearson coefficient, r. Alternative to thisis the Spearman coefficient for rank data.
yPearsons r ranges from -1 to +1. Values closeto either -1 or +1 indicate strong correlationwhile near-zero values mean minimal or nocorrelation.
-
8/4/2019 Basic Statistical Tools for Research
19/53
Correlation AnalysisyPositive correlation means that as onevariable increases, there is a tendency for
the other to increase as well. Also, there is atendency for both variables to decreasetogether.
yNegative correlation means that as onevariable increases, there is a tendency forthe other to decrease; and vice-versa.
-
8/4/2019 Basic Statistical Tools for Research
20/53
Correlation AnalysisyExample: Refer to the data showing 20
nations ranked with respect to births
attended by trained health care personneland maternal mortality rate. Spearmancorrelation (rs) is -0.88 (p=0.000). Asignificant negative correlation exists; there
is a general tendency for maternal mortalityto decrease when more births are attendedby medical personnel.
-
8/4/2019 Basic Statistical Tools for Research
21/53
Nation Rank by AttendedPercentage Rank by Maternal
Mortality Rate per100,000 Live Births
y Bangladesh 1 18y Nepal 2 20y Morocco 3 16y Pakistan 4 17y Nigeria 5 19y Kenya 6 14.5y Philippines 7 11y Iran 8 12.5y Ecuador 9 14.5y Portugal 10 6.5y Vietnam 11 12.5y Spain 12.5 2.5y Panama 12.5 9y Chile 14 10y Switzerland 16 2.5y US A 16 5y Hungary 16 8y Netherlands 19 6.5y Hong Kong 19 4y Belgium 19 1
-
8/4/2019 Basic Statistical Tools for Research
22/53
Paire -Sample TestsyPaired-sample tests are used to test
significant differences in scores between
related observations or matched pairs.yThe two common types of paired-sample
tests are:
y Paired t-test (parametric)y Wilcoxon Signed Ranks Test
(nonparametric)
-
8/4/2019 Basic Statistical Tools for Research
23/53
Paire -Sample TestsyThe paired t-test is used when scores
are assumed to be normally distributedor following a bell-shaped histogram.
yThe Wilcoxon signed-ranks test is used
when there is marked skewness in thedata or when data is measured in anordinal scale (ranks).
-
8/4/2019 Basic Statistical Tools for Research
24/53
In epen ent-Sample Testsy Independent-sample tests are used to
determine if scores significantly differ
between two disjoint or exclusive groups.yThe two most common types of
independent-sample tests are:
Independent-sample t-test (parametric)Mann-Whitney Test (nonparametric)
-
8/4/2019 Basic Statistical Tools for Research
25/53
In epen ent-Sample TestsyLike the paired t-test, the independent
sample t-test is used when scores are
assumed to be normally distributed orfollowing a bell-shaped histogram.
yThe Mann-Whitney test is used when
marked skewness in the observedmeasurements is present or when data isordinal (ranks).
-
8/4/2019 Basic Statistical Tools for Research
26/53
One-way Analysis of VarianceyThe One-wayANOVA is the extension of the
independent-sample t-test to the case of
three or more disjoint or exclusive groups.yWhen data is ordinal or when there is
skewness, the counterpart procedure is theKruskal-Wallis test.
yWhen the null hypotheses of equality ofmeans is rejected, pairwise comparisons arenecessary (e.g. Duncan, Tukey, Scheffe,etc.)
-
8/4/2019 Basic Statistical Tools for Research
27/53
One-way Analysis of VarianceyExample: Four techniques are being
used to perform a task. Five subjectseach were included in the experimentaldesign to determine whether or notthey yield, on the average, the sameresults (time, in seconds). Theanalytical results for the 4 techniquesare as follows:
-
8/4/2019 Basic Statistical Tools for Research
28/53
A 58.7 61.4 60.9 59.1 58.2B 62.7 64.5 63.1 59.2 60.3C 55.9 56.1 57.3 55.2 58.1D 60.7 60.3 60.9 61.4 62.3
Lab A Lab B Lab C Lab D
Mean 59.76
2.0 56
.261
.1
Std. Dev. 1.4 2.2 1.2 0.8
-
8/4/2019 Basic Statistical Tools for Research
29/53
One-way Analysis of VarianceyHo: The means across four techniques
are equal.
yHa: At least one mean is different.
yThe F-test statistic has p-value 0.000.
yAt 5% level of significance, we rejectHo. At least one mean is different.
-
8/4/2019 Basic Statistical Tools for Research
30/53
N-way Analysis of VarianceyAllows analysis of main effects and
interactions
yMost popular is the two-wayANOVA
yPresents difficulty for higher orderA
NOVA
yUseful if there are blocking variables
-
8/4/2019 Basic Statistical Tools for Research
31/53
Regression AnalysisyRegression analysis is a method relevant
to analyzing a variable by using
information on other variables. Thevariable that is being explained oranalyzed is called the response ordependentvariable.
yThe variables whose effects act on theresponse are called predictor, regressor orindependentvariables.
-
8/4/2019 Basic Statistical Tools for Research
32/53
Regression AnalysisyWhen there is only one predictor, we have a
simple linear regression model.
yResponse = function (one predictor)
y Ex. O2Consumption = function of RunningTime
y The formal model is Yi= b0+ b1Xi+ i where i
is a random disturbance.y O2= intercept value + slope value timesRunTime+ random error
-
8/4/2019 Basic Statistical Tools for Research
33/53
Regression AnalysisyWhen there are many predictors, we have
amultiple linear regression model.y Response = function (several predictors)y Ex. O2= function of RunTime and Agey The MLRM is written as
Yi= 0+ 1X1i+ 2X2i+ . + kXki+ ei.Where Yi is the value of the response variablein the ith observation 0, 1, 2, ., k are
parameters of the modely X1i, X2i, .,Xki are the values of the predictors
in the ith observation andei is theerror term
-
8/4/2019 Basic Statistical Tools for Research
34/53
So,I ant to s r gr ssion. What
is the first thingI should do?
IDENTIFYYOUR
RESPONSE VARIABLE!yThis should be quantifiable.
yYes/No, High/Low, andsimilar categorical responses
are not valid here.
-
8/4/2019 Basic Statistical Tools for Research
35/53
How about my pre ictors?yYou may choose quantitative and dummy variables as
your predictors. Quantitative predictors must have
correlation with the response.y Make sure there is no redundancy among
predictors. Check this by computing theircorrelations. If there arecorrelated predictors,choose only the one that has practicalsignificance to your study. There are advancedstatistical methods that treat correlatedpredictors.
-
8/4/2019 Basic Statistical Tools for Research
36/53
Whats next?y You are now ready to fit the regression equation.To
illustrate, consider an example.
RenarInteriors operates in medium size businessareas. In considering an expansion into other areas ofsimilar size, it wishes to investigate how sales (Y) canbe predicted from the size of the target market, i.e.,
the 20-39 age group (X1) and the average monthlyincome of households in the area (X2). Data on these
variables in the most recent year for 21 business areaswhere thecompany operates is given below.
-
8/4/2019 Basic Statistical Tools for Research
37/53
Renar InteriorsDatay See the provided copies.
-
8/4/2019 Basic Statistical Tools for Research
38/53
How to use the excel?In Excel, clickTools, DataAnalysis, Regression.
y 1. Supply the InputY-Range box with the
appropriatecell addresses.y 2. Supply the InputX-Range box with the
appropriatecell addresses of theX1 andX2 valuescontiguously placed in the data matrix.
y3.Supply the Output Range with any convenientlocation.
y 4.Excel shall return an output of analysis.
-
8/4/2019 Basic Statistical Tools for Research
39/53
ResultsyThe Coefficients column gives the
estimated values of the regressionparameters.
yHere,thefitted model is:
Y=-3.887+0.146X1+0.929X2ySALES = -3.887 + 0.146xMarket
Size + 0.929x Income
-
8/4/2019 Basic Statistical Tools for Research
40/53
How o I interpret the fitte
mo el?-3.887y The value of the intercept 3.887 is not interpreted since
the two predictors do not have values equal to zero.
0.146 x Market Sizey There is an estimated increase of 0.146 million pesos (i.e.,
P146,000) in mean sales when the size of the target marketincreases by one percent holding the average monthlyfamily income constant.
0.929 x Incomey There is an estimated increase of 0.929 million pesos (i.e.,
P929,000) in the mean sales when the average monthlyfamily income increases by one thousand pesos holding thesize of the target market constant.
-
8/4/2019 Basic Statistical Tools for Research
41/53
Can I use the mo el alrea y for
pre iction purposes?NOT YET!
y
You still need to investigate themodels goodness-of-fit.
yYou need to prove if your predictors
are significant.yYou must also verify if the
assumptions of regression hold.
-
8/4/2019 Basic Statistical Tools for Research
42/53
How o I assess goo ness-of-fit?Three things:yANOVAyF-testyR squared
They lurk somewhere in the Exceloutput!
-
8/4/2019 Basic Statistical Tools for Research
43/53
Analysis of Variance (ANOVA)y TheANOVAis a decomposition of the total
variation in the response into explained
(pattern) and unexplained (error) parts.y Theexplained variability is the amount of
variation in the response variable that may beattributed to the predictors explicitly stated in
the model.y The unexplained variability is the amount of
variation attributed to random error.
-
8/4/2019 Basic Statistical Tools for Research
44/53
Results from the ANOVA table for
the Renar Interiors datay The first column in the table labels the sources of
variation (Regression and Residual).
y
The df column refers to the degrees of freedom. The df forRegression is always the number ofregression parameters minus one.
The df forResidual, it is the sample size minus the
number of regression parameters. The total df is the sum of these two degrees offreedom.
-
8/4/2019 Basic Statistical Tools for Research
45/53
Results from the ANOVA table for
the Renar Interiors datay SS refers to Sum of Squares. The value 240.3407
represents the amount of variation in sales explainedby the two predictors in the model. The value 21.9658
represents the unexplained variation. These twovalues sum to 262.3065. There is good fit if theRegression Sum of Squares is much larger than theResidual Sum of Squares
y MS refers to Mean Squares. The values in this column
are the ratio ofeach sum of square to their respectivedegrees of freedom. Mean squares have no physicalmeaning but are instrumental in computing the F-statistic.
-
8/4/2019 Basic Statistical Tools for Research
46/53
The F-testyTheF-test determines if
regression is meaningful for thedata at hand. When the p-valueis small (seeSignificance F in
Excel output), it means thatthere is at least one significantpredictor in the analysis.
-
8/4/2019 Basic Statistical Tools for Research
47/53
What is the role of the p-value?y The p-value is our evidence against the hypothesis that
we do not have any significant predictor in the data.When it is small,we reject that hypothesis.
y Technically, we call the above hypothesis our nullhypothesis or Ho.
y Remember: WHENp IS LOW, Ho MUST GO!
yRule of Thumb: The p-value is low if it is less than0.05.
-
8/4/2019 Basic Statistical Tools for Research
48/53
Results from the Renar DatayIn the Renar data, the F-statisticis
98.47 with an associatedp-value of
2.03x10 raised to 10 (almostzero!).
ySince the p-value is lower than 0.05,we reject Ho. Wecan thereforeconclude that at least one of ourtwo predictors can significantlyexplain sales.
-
8/4/2019 Basic Statistical Tools for Research
49/53
The Coefficient of Multiple
Determination (R squared)yThe coefficient of multiple
determination, Rsquared, is a
goodness-of-fit measure.
yR squared is a figure of merit; thehigher theR squared, the better isthe success of the model inexplaining the variation in theresponse using the set of predictors.
-
8/4/2019 Basic Statistical Tools for Research
50/53
Results from the Renar DatayThe R squared is normally expressed as a
percentage and is interpreted as the
amount of variability in the responseexplained by the independent variables.
yThevalue of the R squared = 0.9163 means
that 91.63% of the variation in sales canbeexplained by size of target marketand average monthly family income.
-
8/4/2019 Basic Statistical Tools for Research
51/53
CAVEAT on the Coefficient of
MultipleDetermination (R2)y Adraw back of the R squared is that it naturally
increases as the number of predictors increases. This istrue even if the added predictor(s) are not significant.
y As an alternative, we use the adjusted-Rsquared(Rasquared).
y Ra squaredpenalizes theR squared for theaddition of regressors that do not contribute to
theexplanatory power of the model.y TheRa squared is never larger than theR squared
andcan decrease as regressors are added and forpoorly fitting models, may even be negative.
-
8/4/2019 Basic Statistical Tools for Research
52/53
TheT-testsy The t-test helps in assessing if an individual
predictor is significant.
y Let us interpret the t-tests for the Renar data.
X Variable 1 (Target Market Size): Since p=2.05x10-6
-
8/4/2019 Basic Statistical Tools for Research
53/53