probability and statistics in microsoft excel™/file/excel... · a guide to probability and...

11
a guide to Probability and Statistics in Microsoft Excel™ Resources to support the learning of mathematics, statistics and OR in higher education. www.mathstore.ac.uk The Statistical Education through Problem Solving (STEPS) glossary www.stats.gla.ac/steps/glossary

Upload: dangdien

Post on 11-Dec-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

aguideto

ProbabilityandStatisticsinMicrosoftExcel™

Resourcestosupportthelearningofmathematics,statisticsandORinhighereducation.

www.mathstore.ac.uk

TheStatisticalEducationthroughProblemSolving(STEPS)glossary

www.stats.gla.ac/steps/glossary

ProbabilityandStatisticsinMicrosoftExcel™

Excelprovidesmorethan100functionsrelatingtoprobabilityandstatistics.Italsohasafacilityforconstructingawiderangeofchartsandgraphsfordisplayingdata.ThisleafletprovidesaquickreferenceguidetoassistyouinharnessingExcel’sstatisticalcapability.Exceptwhereindicated,thefeaturesincludedhereareavailableinExcelVersions4.0andabove.AlmostalltheinstructionsherealsoapplytothespreadsheetfacilityinOpenOffice(http://openoffice.org‐suite.com/);anyslightvariationsincommandsshouldbeobvioustotheuser.Excelisnotdesignedforstatisticalcomputing.Ifyourequirestatisticalanalysisbeyonddatavalidationandmanipulation,tabulation,presentationandcalculationofsummarystatistics,youareadvisedtouseabespokestatisticalpackagesuchasMinitaborSPSS.

ExcelhasanAnalysisToolpakoptional“add‐in”facilitythatincludesmacrosforcarryingoutmanyelementarystatisticalanalyses.Theinstructionsforinstallationofthisadd‐invarywiththeversionofExcel—usetheHelpfacilityinExcelforfurtherinformationonthis.Thisadd‐infacilityisnotusedinthisleaflet.Therearetworeasonswhythisadd‐inshouldbeusedwithcare:• Unlikeotherspreadsheetfunctionality,whichensuresthatcalculationsautomaticallyupdateinthe

lightofchangeselsewhereintheworkbook,theoutputfromtheadd‐inisnotdynamicallylinkedtothesourcedata.Henceifanyofthedatachangetheadd‐inmustberunagaintoobtainupdatedoutput.

• Outputfromtheadd‐incanbemisleading(seehttp://support.microsoft.com/kb/829252forexample).Thereareothercommerciallyavailableadd‐insthatmakeuseofExcel’sfamiliaruserinterfacebutsupplementitsstatisticalfunctionality.Examplesinclude:

Analyse‐it® http://www.analyse‐it.com/

R‐Excel http://rcom.univie.ac.at/

Unistat http://www.unistat.com/

XLSTAT http://www.xlstat.com/en/home/

StatTools http://www.palisade.com/stattools/

UsingthisleafletSupposeyouhaveasampleofthreedata,10.4,11.2and16.4,thatyouhaveenteredintocellsA2:A4onaworksheet.InExcelafunction,e.g.SUM,canbeappliedtothesedatainoneoffourways:

=SUM(10.4,11.2,16.4)=SUM(A2,A3,A4)=SUM(A2:A4)

=SUM(x) wherexisthenameattachedtorangeA2:A4.

Inthisleaflet,forsimplicity,wehavechosentorefertonamedranges.Tonamearange,simplyhighlighttherangeofcells,clickintheNameBoxonthefarleftoftheFormulaBar,typeintherequiredname,e.g.x,thenpressEnter.InExcel2007namescanbemanagedviaFormulas>NameManager.Ifyouprefernottousenamestheninwhatfollowssimplyreplacethenameoftherange,e.g.x,bytherangeaddress,e.g.A2:A4.

DescriptiveStatisticsAssumingasampleofdatainrangex

Sampletotal,Σx =SUM(x)Samplesize,n =COUNT(x)Samplemean,Σx/n =AVERAGE(x)

Samplevariance,s2 =VAR(x)Samplestandarddeviation,s =STDEV(x)Meansquareddeviation =VARP(x)

Rootmeansquareddeviation =STDEVP(x)Correctedsumofsquares,Sxx =DEVSQ(x)Rawsumofsquares,Σx2 =SUMSQ(x)

Minimumvalue =MIN(x)Maximumvalue =MAX(x)

Range =MAX(x)‐MIN(x)LowerQuartile,Q1* =QUARTILE(x,1)Median,Q2 =MEDIAN(x)

UpperQuartile,Q3* =QUARTILE(x,3)Interquartilerange,IQR =QUARTILE(x,3)‐QUARTILE(x,1)KthPercentile =PERCENTILE(x,K%) whereKisanumberbetween0and100

Mode =MODE(x)*Note:Thereareseveraldifferentdefinitionsfortheupperandlowerquartiles,sothevaluescalculatedby

Excelmaynotagreewithyourtextbookorotherstatisticalcalculationtools.Boxplot Seehttp://www.coventry.ac.uk/ec/~nhunt/boxplot.htm

GroupedFrequencyDataAssumingafrequencydistributionwithclassmidpointsstoredinrangexandfrequenciesinrangef:

Samplesize,n =SUM(f)Sampletotal,Σfx =SUMPRODUCT(f,x)

Samplemean,Σfx/n =SUMPRODUCT(f,x)/SUM(f)

Correctedsumofsquares,Sxx =SUMPRODUCT(f,x,x)‐SUMPRODUCT(f,x)^2/SUM(f)Samplevariance,s2 =(SUMPRODUCT(f,x,x)‐SUMPRODUCT(f,x)^2/SUM(f))/(SUM(f)‐1)

Samplestandarddeviation,s =SQRT(Samplevariance)

GraphicalRepresentationsExceloffersawiderangeofcharttypesfordisplayingdata.Manyoftheseareover‐elaborate.In

particular,3‐Deffectscanbemisleadingandshouldbeavoided.

InExcel2007toconstructachartforyourdata:1. Selecttherangecontainingyourdata,includinganyroworcolumnlabels.2. Onthemainribbon,clickontheInserttab.

3. UndertheChartsgroupoficons,selectthecharttyperequired,thenthepreferredchartsubtype.4. UnderChartToolsonthemainribbon,usetheDesign,LayoutandFormattabstocustomisethechart.

InearlierversionsofExcel,selectthedatarangeandthenInsert>CharttoinvoketheChartWizard.

PermutationsandCombinationsNumberofdifferentcombinationsofmobjectsselectedfromnobjectsnCm =COMBIN(n,m)

NumberofdifferentpermutationsofmobjectsselectedfromnobjectsnPm =PERMUT(n,m)

StandardProbabilityDistributionsAssumingarandomvariableXandconstantsaandbBinomial Bin(n,p)P(X=a) =BINOMDIST(a,n,p,FALSE)P(X≤a) =BINOMDIST(a,n,p,TRUE)

Geometric Geom(p)P(X=a) =BINOMDIST(1,a,p,FALSE)/aP(X≤a) =1‐BINOMDIST(0,a,p,FALSE)

Poisson Po(λ )P(X=a) =POISSON(a,lambda,FALSE)P(X≤a) =POISSON(a,lambda,TRUE)

Pascal Pasc(n,p)P(X=a) =NEGBINOMDIST(a‐n,n,p)P(X≤a) =BETADIST(p,n,a‐n+1)/BETADIST(1,n,a‐n+1)

Normal N(µ ,σ2)f(a) =NORMDIST(a,mu,sigma,FALSE)P(X≤a) =NORMDIST(a,mu,sigma,TRUE)P(a≤X≤b) =NORMDIST(b,mu,sigma,TRUE) ‐NORMDIST(a,mu,sigma,TRUE)P(X≥b) =1‐NORMDIST(b,mu,sigma,TRUE)

Exponential Expon(θ )f(a) =EXPONDIST(a,theta,FALSE)P(X≤a) =EXPONDIST(a,theta,TRUE)P(a≤X≤b) =EXP(‐a*theta)‐EXP(‐b*theta)P(X≥b) =EXP(‐b*theta)

Gamma Ga(α ,β )f(a) =GAMMADIST(a,alpha,beta,FALSE)P(X≤a) =GAMMADIST(a,alpha,beta,TRUE)P(a≤X≤b) =GAMMADIST(b,alpha,beta,TRUE) ‐GAMMADIST(a,alpha,beta,TRUE)P(X≥b) =1‐GAMMADIST(b,alpha,beta,TRUE)

TestStatisticsforPopularSignificanceTestsOnesampletestofameanAssumingasampleofdatainrangex,drawnfromapopulationwithmeanµandstandarddeviationσ:

H0:µ=µ0H1:µ≠µ0Teststatistic,z =(AVERAGE(x)‐mu0)/(sigma/SQRT(COUNT(x))) assumingσknownTeststatistic,t =(AVERAGE(x)‐mu0)/(STDEV(x)/SQRT(COUNT(x))) assumingσunknownOnesampletestofavarianceAssumingasampleofdatainrangex,drawnfromapopulationwithmeanµandstandarddeviationσ:

H0:σ2=σ02H1:σ2> σ0

2Teststatistic,χ2 =DEVSQ(x)/sigma0^2TwosampletestofdifferencebetweenmeansAssumingtwosamplesofdatainrangesxandy,drawnfrompopulationswithmeansµ1andµ2andequal

variances:H0:µ1‐µ2=cH1:µ1‐µ2≠cEstimatetheunknowncommonstandarddeviationbythepooledestimate:

s =SQRT((DEVSQ(x)+DEVSQ(y))/(COUNT(x)+COUNT(y)‐2))Teststatistic,t =(AVERAGE(x)‐AVERAGE(y)‐c)/(s*SQRT(1/COUNT(x)+1/COUNT(y)))TwosampletestofratioofvariancesAssumingtwosamplesofdatainrangesxandy,drawnfrompopulationswithvariancesσ1

2andσ22:

H0:σ12=σ2

2H1:σ12>σ2

2Teststatistic,F =VAR(x)/VAR(y)Chi‐squaredtestofassociationAssumingatwo‐waycontingencytableofobservedfrequencies.H0:rowfactorindependentofcolumnfactorH1:someassociationbetweenrowandcolumnfactorsThesuggestedlayoutbelowfora4x2tablecaneasilybemodifiedfortablesofothersizes.

A1: =SUM(C3:D6)A3: =SUM(C3:D3) copydowntoA6C1: =SUM(C3:C6) copyacrosstoD1G3: =$A3*C$1/$A$1 copyintoG3:H6C8: =CHITEST(C3:D6,G3:H6)C9: =(COUNT(A3:A6)‐1)*(COUNT(C1:D1)‐1)C10: =CHIINV(C8,C9)

CriticalValuesandP‐valuesforStatisticalTestsTherearetwoapproachestoconductingsignificancetests.Someanalystsliketocomparetheteststatisticwiththecriticalvalueforagivensignificancelevel;othersprefertocalculatetheP‐valuecorrespondingto

theteststatistic.Excelcanbeusedforeithermethod.Assumingsignificancelevelα,(typicallyα=5%or0.05):

Two‐tailedz‐testUppertailcriticalvalue =NORMSINV(1‐alpha/2)P‐valueforgivenz =2*(1‐NORMSDIST(ABS(z)))Two‐tailedt‐testwithvdegreesoffreedomUppertailcriticalvalue =TINV(alpha,v)P‐valueforgivent =TDIST(ABS(t),v,2)

One‐tailedχ2‐testwithvdegreesoffreedomUppertailcriticalvalue =CHIINV(alpha,v)P‐valueforgivenchisquared=CHIDIST(chisquared,v)One‐tailedF‐testwithv1degreesoffreedominthenumeratorandv2inthedenominatorUppertailcriticalvalue =FINV(alpha,v1,v2)P‐valueforgivenF =FDIST(F,v1,v2)

ConfidenceLimitsAssumingdegreeofconfidence100(1‐α)% (e.g.for95%confidenceα=0.05):

One‐samplestatistics,withdatainrangex

For µ (σ known) Lowerlimit=AVERAGE(x)‐NORMSINV(1‐alpha/2)*sigma/SQRT(COUNT(x))

or =AVERAGE(x)‐CONFIDENCE(alpha,sigma,COUNT(x))

Upperlimit=AVERAGE(x)+NORMSINV(1‐alpha/2)*sigma/SQRT(COUNT(x))or =AVERAGE(x)+CONFIDENCE(alpha,sigma,COUNT(x))

For µ (σ unknown) Lowerlimit=AVERAGE(x)‐TINV(alpha, COUNT(x)‐1)*STDEV(x)/SQRT(COUNT(x))

Upperlimit=AVERAGE(x)+TINV(alpha, COUNT(x)‐1)*STDEV(x)/SQRT(COUNT(x))

For σ2 Lowerlimit=(DEVSQ(x)/CHIINV(alpha/2,COUNT(x))‐1)

Upperlimit=(DEVSQ(x)/CHIINV(1‐alpha/2,COUNT(x))‐1)

Two‐samplestatistics,withdataforthefirstsampleinrangex,andthesecondsampleinrangey

For µ x‐µ y (σ xknown,σ y known)

Lowerlimit=AVERAGE(x)‐AVERAGE(y)‐NORMSINV(1‐alpha/2)*SQRT(sigmax^2/COUNT(x)+sigmay^2/COUNT(y))

Upperlimit=AVERAGE(x)‐AVERAGE(y)+NORMSINV(1‐alpha/2)*SQRT(sigmax^2/COUNT(x)+sigmay^2/COUNT(y))

For µ x‐µ y (σ xandσ y unknownbutassumedequal)

Estimatetheunknowncommonstandarddeviationbythepooledestimate:s =SQRT((DEVSQ(x)+DEVSQ(y))/(COUNT(x)+COUNT(y)‐2))

Lowerlimit=AVERAGE(x)‐AVERAGE(y)‐TINV(alpha,COUNT(x)+COUNT(y)‐2)*s*SQRT(1/COUNT(x)+1/COUNT(y))

Upperlimit=AVERAGE(x)‐AVERAGE(y)+TINV(alpha,COUNT(x)+COUNT(y)‐2)*s*SQRT(1/COUNT(x)+1/COUNT(y))

For σ x2/σ y

2

Lowerlimit=DEVSQ(x)/DEVSQ(y)/FINV(alpha/2,COUNT(x)‐1,COUNT(y)‐1)

Upperlimit(DEVSQ(x)/DEVSQ(y)/FINV(1‐alpha/2,COUNT(x)‐1,COUNT(y)‐1)

SimpleLinearRegressionInExcelVersions5andabove,aregressionline(ortrendline)canbeaddedtoascatterplotbyright‐clicking

ononeoftheplottedpointsandselectingAddTrendlinefromtheshortcutmenu.Bothlinearandavarietyofnon‐linearmodelsmaybefittedtothedata.Theequationofthefittedmodelmaybedisplayed,togetherwiththevalueofthecoefficientofdetermination,R2.Therearealsooptionstoextrapolatethe

trendlineineitherdirection,ortoforcethetrendlinetohaveaspecificintercept.

Thetrendlineapproachispurelygraphical.Tocalculatepredictions,regressionfunctionsmustbeused.

Assumingasampleofvaluesoftheindependentvariableinrangex,andcorrespondingvaluesofthe

dependentvariableinrangey:

Leastsquaresestimateofintercept,a =INTERCEPT(y,x)Leastsquaresestimateofslope,b =SLOPE(y,x)

Sxy =SUMPRODUCT(x,y)‐COUNT(x)*AVERAGE(x)*AVERAGE(y)Sxx =DEVSQ(x)Syy =DEVSQ(y)

Samplecovariance,Cov(x,y) =COVAR(x,y)*COUNT(x)/(COUNT(x)‐1)Estimateofσ,s =STEYX(y,x)

Predictionofyatx=x0,ŷ=a+bx0 =FORECAST(x0,y,x)Estimatedstandarderrorofindividualpredictedyatx=x0

=STEYX(y,x)*SQRT(1+1/COUNT(x)+(x0‐AVERAGE(x))^2/DEVSQ(x))Estimatedstandarderrorofmeanpredictedyatx=x0 =STEYX(y,x)*SQRT(1/COUNT(x)+(x0‐AVERAGE(x))^2/DEVSQ(x))

CorrelationAssumingtwosamplesofpaireddatainrangesxandy:

Pearsonproductmomentcorrelationcoefficient,r =CORREL(x,y)

RankCorrelationAssumingtwosamplesofpaireddatainrangesxandywithnoties:Rankofithvalueinrangex =RANK(INDEX(x,i),x,1)Assumingtwosamplesofpaireddatainrangesxandywithsometiedvalues:Rankofithvalueinrangex =(RANK(INDEX(x,i),x,1)‐RANK(INDEX(x,i),x,0)+COUNT(x)+1)/2Assumingthattherangesrxandrycontaintheranksofthedatainxandyrespectively:Spearmanrankcorrelationcoefficient,rS=CORREL(rx,ry)

Intheexampleabove:

D2: =RANK(B2,$B$2:$B$7,1) copydowntoD7E2: =RANK(B2,$B$2:$B$7,0) copydowntoE7F2: =(D2‐E2+COUNT($B$2:$B$7)+1)/2 copydowntoF7F9: =CORREL(C2:C7,F2:F7) adjustedforties

TimeSeriesTheexamplesbelowrefertothreeyearsofobservedquarterlydata.Forecastsaremadeforafurtherfourquarters(oneextrayear).

Levelonly

Simplemovingaverageperiod5C4: =AVERAGE(B2:B6) copydowntoC11C14: =C$11 copydowntoC17

Centredmovingaverageperiod4D4: =(AVERAGE(B2:B5)+AVERAGE(B3:B6))/2 copydowntoD11D14: =D$11 copydowntoD17

ExponentiallyweightedmovingaverageE2: =B2 initiallevelestimateE3: =$G2*B3+(1‐$G2)*E2 copydowntoE13E14: =E$13 copydowntoE17ThechartwasdrawnbyhighlightingB1:B17andE1:E17thenusingInsert>Charts>Line>2‐DLine.

Levelandconstanttrend

C2: =FORECAST(A2,$B$2:$B$13,$A$2:$A$13) copydowntoC17

Levelandchangingtrend

C2: =B2 initiallevelestimateC3: =$F2*B3+(1‐$F2)*(C2+D2) copydowntoC13D2: =B3‐B2 initialtrendestimateD3: =$G2*(C3‐C2)+(1‐$G2)*D2 copydowntoD13

E3: =C2+D2 copydowntoE13E14: =C$13+(A14‐A$13)*D$13 copydowntoE17

Level,changingtrendandseasonality

C5: =AVERAGE(B2:B5) initiallevelestimateC6: =G$2*B6/E2+(1‐G$2)*(C5+D5) copydowntoC13D5: =(AVERAGE(B6:B9)‐C5)/4 initialtrendestimateD6: =H$2*(C6‐C5)+(1‐H$2)*D5 copydowntoD13

E2: =B2/C$5 copydowntoE5,initialseasonalestimatesE6: =I$2*B6/C6+(1‐I$2)*E2 copydowntoE13F6: =(C5+D5)*E2 copydowntoF13

F14: =(C$13+(A14‐A$13)*D$13)*E10 copydowntoF17

Version9,9June2009