basic statistical tools for research

8/4/2019 Basic Statistical Tools for Research

1/53

by

Benjamin L. Marciano, Jr.


2/53

ObjectivesyUnderstand the statistical nature of

research data

yIdentify approaches in quantitativeresearch planning (data collection,

organization and analysis)yIdentify appropriate statistical

techniques for a given study design


3/53

Consid rations inChoosing

Statistical Tools

y1. Level of Measurement

y2. Nature of StatisticalRelationship

y3. Parametric versusNonparametric Test


4/53

Levels of MeasurementyNominal numbers are just categories

y

Ordinal ranks, hierarchy, orderyInterval equally spaced scores; no

mathematical concept of multiplicity;

no true zeroyRatio highest level of measurement


5/53

a ure o a s ca

Relationship( epen s on objective

of the stu y)

yAssociation/Correlation

yComparing groups or treatmenteffects

yPredicting a value of an attribute ofinterest

yTesting the effect of several factors

on a response


6/53

Parametric vs. NonparametricChoice relies on

y

the level of measurementyassumption of normality

ysample size

Note: Parametric tests are generallymore powerful than nonparametrictests.


7/53

Probability and Non-probability

Samplingy Probability procedure wherein every

Sampling element of the population

is given a (known) nonzerochance of being selected in

the sample

y Nonprobability procedure wherein not all

Sampling the elements in thepopulation are given a chanceof being included

in the sample


8/53

Issuesy Choice relies on

Nature of measurement

Variation in the populationTolerable margin of error

y Treatment of Heterogeneity

Stratification

Clustering

Multi-staging

y Formula


9/53

Testing Statistical HypothesesThe Hypotheses

y

Null hypothesis (Ho) -the hypothesis ofno difference or no effect

yAlternative hypothesis (Ha) -the

operational statement that is acceptedin case the null hypothesis is rejected


10/53

Testing Statistical HypothesesLevel of Significance (alpha)

y the size of the risk (0 < alpha< 1) of erroneously

rejecting Ho that the researcher is willing to makey The choice of alpha usually depends on the

consequences associated with erroneouslyrejecting Ho.

y alpha=0.01 or less => very serious error

y alpha=0.05 => moderate

y alpha=0.10 => not too serious error


11/53

A Summary ofPossible Decisions in

Hypothesis TestingState of Nature

(True Situation)Ho is true Ho is false

Decision(Data says) Reject Ho TYPE Ierror CORRECTdecision

chance of chance ofoccurrence=alpha occurrence= 1 - beta(level of significance) (power of the test)

Do not reject Ho CORRECTdecision TYPEII errorchance of chance of occurrence= 1 - alpha occurrence= beta


12/53

Testing Statistical HypothesesThe p-value

y the smallest level of significance at which Ho will

be rejected based on the information contained inthe sample

y Alternative form of decision rule based on the p-value:

Reject Ho if the p-value is less than or equal to thelevel of significance (alpha).

y Remember: If p is low, Ho must go!


13/53


14/53

DESCRIPTIVE METHODSDescribing and Summarizing

A Set of MeasurementsyPresentation of Tables

yConstruction of Graphs

yComputation of SummaryMeasures


15/53

How to escribe atayAverages describe the central value

Issue: Which average to use?

yVariation describes extent of dispersionIssue: Absolute or comparative dispersion?

ySkewness describes degree of asymmetry

Where in the range of values do datacluster?

yPercentiles identify markers or thresholds


16/53

Chi-Square TestyThe chi-square test determines the

association between two (categorical)

variables set in a contingency table.yGenerally regarded as a nonparametric test

though no parametric counterpart is gaining

popularity.yThe Fisher Exact Test is an alternative to this

test for 2x2 contingency tables.


17/53

Chi-Square TestLow Income Middle Income High Income

(-) attitude 31 29 27

(+) attitude 48 93 165Total 79 122 192

The null and alternative hypotheses are-

y Ho: Socioeconomic status and attitude areindependent.

y Ha: The 2 variables are associated.


18/53

Correlation AnalysisyCorrelation means the degree of linear

association between two measurements.

yThe most common correlation measure isthe Pearson coefficient, r. Alternative to thisis the Spearman coefficient for rank data.

yPearsons r ranges from -1 to +1. Values closeto either -1 or +1 indicate strong correlationwhile near-zero values mean minimal or nocorrelation.


19/53

Correlation AnalysisyPositive correlation means that as onevariable increases, there is a tendency for

the other to increase as well. Also, there is atendency for both variables to decreasetogether.

yNegative correlation means that as onevariable increases, there is a tendency forthe other to decrease; and vice-versa.


20/53

Correlation AnalysisyExample: Refer to the data showing 20

nations ranked with respect to births

attended by trained health care personneland maternal mortality rate. Spearmancorrelation (rs) is -0.88 (p=0.000). Asignificant negative correlation exists; there

is a general tendency for maternal mortalityto decrease when more births are attendedby medical personnel.


21/53

Nation Rank by AttendedPercentage Rank by Maternal

Mortality Rate per100,000 Live Births

y Bangladesh 1 18y Nepal 2 20y Morocco 3 16y Pakistan 4 17y Nigeria 5 19y Kenya 6 14.5y Philippines 7 11y Iran 8 12.5y Ecuador 9 14.5y Portugal 10 6.5y Vietnam 11 12.5y Spain 12.5 2.5y Panama 12.5 9y Chile 14 10y Switzerland 16 2.5y US A 16 5y Hungary 16 8y Netherlands 19 6.5y Hong Kong 19 4y Belgium 19 1


22/53

Paire -Sample TestsyPaired-sample tests are used to test

significant differences in scores between

related observations or matched pairs.yThe two common types of paired-sample

tests are:

y Paired t-test (parametric)y Wilcoxon Signed Ranks Test

(nonparametric)


23/53

Paire -Sample TestsyThe paired t-test is used when scores

are assumed to be normally distributedor following a bell-shaped histogram.

yThe Wilcoxon signed-ranks test is used

when there is marked skewness in thedata or when data is measured in anordinal scale (ranks).


24/53

In epen ent-Sample Testsy Independent-sample tests are used to

determine if scores significantly differ

between two disjoint or exclusive groups.yThe two most common types of

independent-sample tests are:

Independent-sample t-test (parametric)Mann-Whitney Test (nonparametric)


25/53

In epen ent-Sample TestsyLike the paired t-test, the independent

sample t-test is used when scores are

assumed to be normally distributed orfollowing a bell-shaped histogram.

yThe Mann-Whitney test is used when

marked skewness in the observedmeasurements is present or when data isordinal (ranks).


26/53

One-way Analysis of VarianceyThe One-wayANOVA is the extension of the

independent-sample t-test to the case of

three or more disjoint or exclusive groups.yWhen data is ordinal or when there is

skewness, the counterpart procedure is theKruskal-Wallis test.

yWhen the null hypotheses of equality ofmeans is rejected, pairwise comparisons arenecessary (e.g. Duncan, Tukey, Scheffe,etc.)


27/53

One-way Analysis of VarianceyExample: Four techniques are being

used to perform a task. Five subjectseach were included in the experimentaldesign to determine whether or notthey yield, on the average, the sameresults (time, in seconds). Theanalytical results for the 4 techniquesare as follows:


28/53

A 58.7 61.4 60.9 59.1 58.2B 62.7 64.5 63.1 59.2 60.3C 55.9 56.1 57.3 55.2 58.1D 60.7 60.3 60.9 61.4 62.3

Lab A Lab B Lab C Lab D

Mean 59.76

2.0 56

.261

.1

Std. Dev. 1.4 2.2 1.2 0.8


29/53

One-way Analysis of VarianceyHo: The means across four techniques

are equal.

yHa: At least one mean is different.

yThe F-test statistic has p-value 0.000.

yAt 5% level of significance, we rejectHo. At least one mean is different.


30/53

N-way Analysis of VarianceyAllows analysis of main effects and

interactions

yMost popular is the two-wayANOVA

yPresents difficulty for higher orderA

NOVA

yUseful if there are blocking variables


31/53

Regression AnalysisyRegression analysis is a method relevant

to analyzing a variable by using

information on other variables. Thevariable that is being explained oranalyzed is called the response ordependentvariable.

yThe variables whose effects act on theresponse are called predictor, regressor orindependentvariables.


32/53

Regression AnalysisyWhen there is only one predictor, we have a

simple linear regression model.

yResponse = function (one predictor)

y Ex. O2Consumption = function of RunningTime

y The formal model is Yi= b0+ b1Xi+ i where i

is a random disturbance.y O2= intercept value + slope value timesRunTime+ random error


33/53

Regression AnalysisyWhen there are many predictors, we have

amultiple linear regression model.y Response = function (several predictors)y Ex. O2= function of RunTime and Agey The MLRM is written as

Yi= 0+ 1X1i+ 2X2i+ . + kXki+ ei.Where Yi is the value of the response variablein the ith observation 0, 1, 2, ., k are

parameters of the modely X1i, X2i, .,Xki are the values of the predictors

in the ith observation andei is theerror term


34/53

So,I ant to s r gr ssion. What

is the first thingI should do?

IDENTIFYYOUR

RESPONSE VARIABLE!yThis should be quantifiable.

yYes/No, High/Low, andsimilar categorical responses

are not valid here.


35/53

How about my pre ictors?yYou may choose quantitative and dummy variables as

your predictors. Quantitative predictors must have

correlation with the response.y Make sure there is no redundancy among

predictors. Check this by computing theircorrelations. If there arecorrelated predictors,choose only the one that has practicalsignificance to your study. There are advancedstatistical methods that treat correlatedpredictors.


36/53

Whats next?y You are now ready to fit the regression equation.To

illustrate, consider an example.

RenarInteriors operates in medium size businessareas. In considering an expansion into other areas ofsimilar size, it wishes to investigate how sales (Y) canbe predicted from the size of the target market, i.e.,

the 20-39 age group (X1) and the average monthlyincome of households in the area (X2). Data on these

variables in the most recent year for 21 business areaswhere thecompany operates is given below.


37/53

Renar InteriorsDatay See the provided copies.


38/53

How to use the excel?In Excel, clickTools, DataAnalysis, Regression.

y 1. Supply the InputY-Range box with the

appropriatecell addresses.y 2. Supply the InputX-Range box with the

appropriatecell addresses of theX1 andX2 valuescontiguously placed in the data matrix.

y3.Supply the Output Range with any convenientlocation.

y 4.Excel shall return an output of analysis.


39/53

ResultsyThe Coefficients column gives the

estimated values of the regressionparameters.

yHere,thefitted model is:

Y=-3.887+0.146X1+0.929X2ySALES = -3.887 + 0.146xMarket

Size + 0.929x Income


40/53

How o I interpret the fitte

mo el?-3.887y The value of the intercept 3.887 is not interpreted since

the two predictors do not have values equal to zero.

0.146 x Market Sizey There is an estimated increase of 0.146 million pesos (i.e.,

P146,000) in mean sales when the size of the target marketincreases by one percent holding the average monthlyfamily income constant.

0.929 x Incomey There is an estimated increase of 0.929 million pesos (i.e.,

P929,000) in the mean sales when the average monthlyfamily income increases by one thousand pesos holding thesize of the target market constant.


41/53

Can I use the mo el alrea y for

pre iction purposes?NOT YET!

y

You still need to investigate themodels goodness-of-fit.

yYou need to prove if your predictors

are significant.yYou must also verify if the

assumptions of regression hold.


42/53

How o I assess goo ness-of-fit?Three things:yANOVAyF-testyR squared

They lurk somewhere in the Exceloutput!


43/53

Analysis of Variance (ANOVA)y TheANOVAis a decomposition of the total

variation in the response into explained

(pattern) and unexplained (error) parts.y Theexplained variability is the amount of

variation in the response variable that may beattributed to the predictors explicitly stated in

the model.y The unexplained variability is the amount of

variation attributed to random error.


44/53

Results from the ANOVA table for

the Renar Interiors datay The first column in the table labels the sources of

variation (Regression and Residual).

y

The df column refers to the degrees of freedom. The df forRegression is always the number ofregression parameters minus one.

The df forResidual, it is the sample size minus the

number of regression parameters. The total df is the sum of these two degrees offreedom.


45/53

Results from the ANOVA table for

the Renar Interiors datay SS refers to Sum of Squares. The value 240.3407

represents the amount of variation in sales explainedby the two predictors in the model. The value 21.9658

represents the unexplained variation. These twovalues sum to 262.3065. There is good fit if theRegression Sum of Squares is much larger than theResidual Sum of Squares

y MS refers to Mean Squares. The values in this column

are the ratio ofeach sum of square to their respectivedegrees of freedom. Mean squares have no physicalmeaning but are instrumental in computing the F-statistic.


46/53

The F-testyTheF-test determines if

regression is meaningful for thedata at hand. When the p-valueis small (seeSignificance F in

Excel output), it means thatthere is at least one significantpredictor in the analysis.


47/53

What is the role of the p-value?y The p-value is our evidence against the hypothesis that

we do not have any significant predictor in the data.When it is small,we reject that hypothesis.

y Technically, we call the above hypothesis our nullhypothesis or Ho.

y Remember: WHENp IS LOW, Ho MUST GO!

yRule of Thumb: The p-value is low if it is less than0.05.


48/53

Results from the Renar DatayIn the Renar data, the F-statisticis

98.47 with an associatedp-value of

2.03x10 raised to 10 (almostzero!).

ySince the p-value is lower than 0.05,we reject Ho. Wecan thereforeconclude that at least one of ourtwo predictors can significantlyexplain sales.


49/53

The Coefficient of Multiple

Determination (R squared)yThe coefficient of multiple

determination, Rsquared, is a

goodness-of-fit measure.

yR squared is a figure of merit; thehigher theR squared, the better isthe success of the model inexplaining the variation in theresponse using the set of predictors.


50/53

Results from the Renar DatayThe R squared is normally expressed as a

percentage and is interpreted as the

amount of variability in the responseexplained by the independent variables.

yThevalue of the R squared = 0.9163 means

that 91.63% of the variation in sales canbeexplained by size of target marketand average monthly family income.


51/53

CAVEAT on the Coefficient of

MultipleDetermination (R2)y Adraw back of the R squared is that it naturally

increases as the number of predictors increases. This istrue even if the added predictor(s) are not significant.

y As an alternative, we use the adjusted-Rsquared(Rasquared).

y Ra squaredpenalizes theR squared for theaddition of regressors that do not contribute to

theexplanatory power of the model.y TheRa squared is never larger than theR squared

andcan decrease as regressors are added and forpoorly fitting models, may even be negative.


52/53

TheT-testsy The t-test helps in assessing if an individual

predictor is significant.

y Let us interpret the t-tests for the Renar data.

X Variable 1 (Target Market Size): Since p=2.05x10-6


53/53

basic statistical tools for research

Documents