6300p2a (1)

Upload: rohanputhalath

Post on 07-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/20/2019 6300p2a (1)

    1/19

    The Normal Distribution

    The normal distribution is the pattern of the distribution of a set of data that follows a bell shaped curve. It isimportant because:

    • Many variables are distributed approximately normally.

    • It is easy to work with normal distribution as many kinds of statistical tests can be derived for it. These tests work

    very well even if the distribution is only approximately normally distributed.

    This distribution is sometimes called the Gaussian distribution in honor of Carl Friedrich Gauss a famousmathematician.

    !roperties of the normal curve are as follows:

    • The curve has a peak at its center and decreases on either side which shows that extreme values are

    infre"uent.

    • !robabilities associated with the normal curve correspond to areas under the curve. The total area under thecurve is # $i.e. #%%&'. The probability of a particular point is % $since no associated area above a point'.

    • Most of the values lie towards the center of the curve with the arithmetic mean median and mode lyin( at

    its very center.

    • The curve is symmetric about a line that extends up from the mean $plotted on the axis' ) so the area above

    the curve on each side of the line is %.*.

    • The probability of deviations from the mean is comparable in either direction as the curve is symmetric.

    • There is a family of distributions+ they are all bell shaped but vary based on the mean and standard

    deviation.

    • The (reater the standard deviation the flatter is the curve.

    *a ,yp test #

  • 8/20/2019 6300p2a (1)

    2/19

    • The second curve seems to have the (reatest standard deviation amon( the three curves.

    • The third curve has the (reatest mean value.

    Standard Normal Distribution

    -very value in a data set can be expressed as a /value $number of standard deviations from the mean'.

    The standard normal distribution is formed usin( these values and it is a particular normal distributionwith µ =0 and σ =1. !robabilities are usually found by convertin( to / values and then usin( a

    standard normal table or appropriate 01 function.

    The t-distribution

    2hen the sample sie is less than 3% and4or the population standard deviation is unknown the t/distribution is used instead of the standard normal distribution.

    The t/distribution is a family of probability distributions+ each separate t/distribution is determined bya parameter called the de(rees of freedom $df'. The value of the de(rees of freedom is based on thesample sie. 5s the sample sie increases the t/distribution approaches the /distribution $the normalcurve'.

    For Discussion

    5 company pays its employees an avera(e wa(e of 6#7.%% an hour with a standard deviation of 67. If

    the wa(es are approximately normally distributed determine the followin(.

    a. the proportion of the workers (ettin( wa(es between 6#% and 6#8 an hour+

     b. the minimum wa(e of the 7.*& of the employees receivin( the lowest salary+

    c. the minimum wa(e of the 7.*& of the employees receivin( the hi(hest salary.

    *a ,yp test 7

    9

     X −

  • 8/20/2019 6300p2a (1)

    3/19

    d. 2hat is the salary ran(e for almost all employees

    Is the Distribution Normal?

    Many inferential procedures assume that the data is normal. To ud(e whether your data is normal trythe followin(.

    #. ;eview the

    8. Consider how the data compares with the -mpirical ;ule>• 5bout B& of the observations within mean D# standard deviation

    • 5t least E*& & of the observations within mean D7 standard deviations

    • 5t least EE& of the observations within mean D3 standard deviations

    The followin( chart shows how the =alary -xperience data relates to the -mpirical ;ule.

    *a ,yp test 3

  • 8/20/2019 6300p2a (1)

    4/19

    *. -xamine a normal probability plot. 5 probability plot shows the relationship between the (ivendata set and the /values of the normal distribution.

    • If all the points fall on a line the distribution is normal.

    • If the relationship is not linear the distributions have different shapes.

    Interpreting Non-linear Patterns for a Probability Plot

    *a ,yp test 8

    The points on this plot form anearly linear pattern whichindicates that the normaldistribution is a reasonablemodel for this data set.

  • 8/20/2019 6300p2a (1)

    5/19

    Introduction to Hypothesis Testing

    5 hypothesis test involves a process of makin( statement$s' $i.e. hypotheses' about the parameter$s' of one or more populations and the use of statistical theory and techni"ues to ud(e the validity of thestatement$s'. It enables us to draw conclusions or make decisions re(ardin( a population parameter

    from a sample statistic.

    Parametric vs Non-parametric Tests

    ,ypothesis test methodolo(ies may be (rouped into two broad cate(ories ) parametric vs. non/ parametric techni"ues.

    • !arametric methods assume that you are testin( the value of one or more population parameters

    sample data is measured on an interval or ratio scale and there is an associated samplin(distribution related to the parameter bein( tested.

    •  on/parametric methods are applied if one or more of the above assumptions does not hold e.(.

    .your data is ordinal.

    !"ample# $ne Sample Hypothesis Test Summary % &T'o- tailed( &  )**+(

    =ome mana(ers believe that the salaries of the employees have chan(ed since the previous yearHssalary of 6BE%%%. To check this view a random sample of #*% employees is obtained and thefollowin( sample statistics are found.

     9 6@7%%%+ n 9 #*% and s9 6##%%%

    5 hypothesis test is conducted to see if this data provides evidence of a chan(e in salary.

    #. The first step in a hypothesis test is a statement of a null hypothesis and an alternate hypothesis.

    • 5 null hypothesis is a statement about the population ) used as a basis for ar(ument / it has not been

     proved. In fact we are typically lookin( to find evidence that disputes the null.

    • The alternate hypothesis is a lo(ical alternative to the null. It is a statement of what the test is set up to

    establish and (enerally a statement of the anticipated outcome of the test  

    In this case the hypothesis would be the followin(.

    • The null hypothesis is:

    ,o: 9 6BE%%% $The mean salary of the population of employees is 6 BE%%%.'

    • The alternate hypothesis is:

    ,a: J 6BE%%% $The mean salary of the population of employees is not 6 BE%%%.'

    *a ,yp test *

  • 8/20/2019 6300p2a (1)

    6/19

    7. 5 test statistic is calculated.

    5 common test statistic is a t/value. In this case we are testin( the value of one population mean

    and the t/value $or sometimes used / a value' is a measure of Khow farL the sample mean is fromthe population mean that is specified in the null hypothesis. That is: we want to know how manystandard deviations $called the standard error' the sample mean is from the population mean.

    For this data we (et the followin( output from 01.

    3.  ow we must make a statistical decision. In this case I am askin( the "uestion:if the null hypothesis is really true is it likely that I would (et a sample mean that is more than 3standard deviations from the mean>

    The decision rule of acceptance or reection of null hypothesis is determined usin( a critical valueor the p-value. This value is a measure of how much evidence we have a(ainst the null hypothesis.It tells you precisely the followin(: the probability that you would obtain these or more extremeresults assumin( that the null hypothesis is true.

    The standard for makin( the decision is called the level of si(nificance and is desi(nated by theletter . $ is the probability of reectin( a true null hypothesis'. This value is set by the researcher.

    Nsually researchers will reect a hypothesis if the p/value is less than 9 %.%*.

    =ometimes researchers will use a stricter cut/off $e.(. 9 %.%#' or sometimes researchers will

    use a more liberal cut/off $e.(. 9 %.#%' The (eneral rule is that a small p/value is evidence a(ainst the null hypothesis while a lar(e p/

    value means little or no evidence a(ainst the null hypothesis.

    =o the sample mean is 3.%# standard deviations from the hypothesied population mean.

    =ince: p 9 %.%%3%3 O %.%* →  reect ,o

    2e are sayin( that / if the population mean is really 6BE%%% it is unlikely that we would (et asample mean of 6@7%%%. The probability that we are reectin( a true null hypothesis is %.%%3

    8. Conclude the startin( salaries have chan(ed. In fact by lookin( at the data we can note that theyhave increased.

    *a ,yp test B

    In -xcel use f$x':

    T.?I=T.7T$meandf'

  • 8/20/2019 6300p2a (1)

    7/19

    If we (et a p/value that is (reater than our α 9 %.%* we must stick with the null ) pendin( more

    evidence to reect. Pbserve that we have not proved ,o we ust canHt reect it as a possibility. ote: In statistics a result is called statistically significant if it is unlikely to have occurred bychance. 5 statistically si(nificant differenceQ simply means there is statistical evidence to say that there

    is a difference+ It does not mean the difference is necessarily lar(e important or si(nificant in the common

    meanin( of the word.

    For Discussion

    #. Reta Corporation is a company has a stable workforce with little turnover. They have been in

     business for *% years. It has more than #%%%% employees.

    The company has always promoted the idea that its employees stay with them for a very lon( time

    and it has used the followin( line in its recruitment  brochures: QThe avera(e tenure of our

    employees is 7% years.Q

    =ince Reta isnSt "uite sure if that statement is still true a random sample of #%% employees is taken

    and the avera(e a(e turns out to be #E years with a standard deviation of 8 years. Can Reta

    continue to make its claim or does it need to make a chan(e>

    a. =tate the hypotheses.

     b. 2hat is the test statistic> $ote that the sample is randomly selected from a population that is assumed

    to be normally distributed and4or we have a lar(e sample'

    c. =pecify the si(nificance level

    d. Calculations: The calculated value is /7.*

    It has p 9 %.%%B7

    ;eect or fail to reect the null>

    e. Conclusion>

    7. 5 bank tests the null hypothesis that the mean a(e of the bankSs mort(a(e holders is e"ual to 8*

    years versus an alternative that the mean a(e is not 8* years. The bankHs reps take a sample and

    *a ,yp test @

    In 01 use:

     P;M.=.?I=T$/7.*T;N-' or

    http://www.referenceforbusiness.com/knowledge/Recruitment.htmlhttp://www.referenceforbusiness.com/knowledge/Recruitment.htmlhttp://www.referenceforbusiness.com/knowledge/Statistical_significance.htmlhttp://www.referenceforbusiness.com/knowledge/Statistical_significance.htmlhttp://www.referenceforbusiness.com/knowledge/Statistical_significance.htmlhttp://www.referenceforbusiness.com/knowledge/Recruitment.html

  • 8/20/2019 6300p2a (1)

    8/19

    calculate a p-value of %.%7%7. =hould you reect the null hypothesis> 5t what level are the results

    si(nificant>

    3. 5n airline company would like to know if the avera(e number of passen(ers on a fli(ht in

     ovember is different than the avera(e number of passen(ers on a fli(ht in ?ecember. The results

    of a hypothesis test are below. 2hat are the appropriate hypotheses> If p O %.%%# would you reect

    the null hypothesis usin( α 9 %.%#.

    t 9 /8.B7 critical value of t O /7.33 p 9 %.%%E

    !rrors in Hypothesis Testing

    =ometimes you will see a reference to ,ypothesis Test errors. They are classified as Type I and Type IIerrors.

     ote: ou set the value of α. The value of β depends on the difference between the hypothesied mean

    and the true value of the population mean. $To reduce β, you could increase the sample sie or increase

    α.'

    For Discussion5 mail/order catalo( claims that customers will receive their product within 8 days of orderin(. 5

    competitor believes that this claim is not accurate $an underestimate'.

    #. =tate the appropriate null and alternative hypotheses to be tested by the competitor.

    7. ?escribe the Type I error for this problem.

    3. ?escribe the Type II error for this problem.

    *a ,yp test

  • 8/20/2019 6300p2a (1)

    9/19

    5=2-;: 5 Type I error would occur if the competitor would reect the null hypothesis when in fact it is true i.e. if the competitor wouldconclude the mean time until the product is received is more than 8 days when in fact it is not.

    5 Type II error would occur if the competitor would not reect the null hypothesis when in fact it is false i.e. if the competitor would not concludethe mean time until the product is received is more than 8 days when in fact it is.

    *a ,yp test E

  • 8/20/2019 6300p2a (1)

    10/19

    ,ssumptions

    The validity of the results of a hypothesis test depend on certain assumptions and the specifichypothesis test should not be use if these assumptions are not satisfied

    $ne Sample Hypothesis Test

    Use to see if the population

    mean is a specified value or

     standard.

    /test 5ssumptions• Nnderlyin( distribution is normal or the C1T can beassumed to hold $There is a note on the C1T at the end ofthis section.'

    • ;andom sample

    • !opulation standard deviation is known $rarely known'

    t/ test 5ssumptions• Nnderlyin( distribution is normal or sample sie is lar(e so

    the C1T can be assumed to hold

    • ;andom sample

    T'o Independent Samples

    Hypothesis Test

    Use to compare two population

    means from independent

     samples

    /test 5ssumptions• Nnderlyin( distribution is normal or the C1T can be

    assumed to hold

    • ;andom U independently selected samples from 7 populations

    • !opulation standard deviation is known

    t/ test 5ssumptions• Nnderlyin( distribution is normal or the C1T can be

    assumed to hold

    • ;andom U independently selected samples from 7

     populations

    • The variability of the measurements in the two populations

    is the same and can be measured by a common variance.$-xcel offers a t/test that does not make this assumption '

    !"ample# T'o-Sample Hypothesis using the t-distribution

    Problem: For the =alary V -xperience data set test to see if there is a difference in salaries foremployees who selection into their current position was from within the company $internal employees'versus who were selected into their position from outside the company $external employees'.

      Null Hypothesis

    ,ο: µΑ = µ Β (or  µΑ - µ< 9 %'  &The salaries are e"ual.'

    ,lternate Hypothesis

    ,a: µ Α≠ µ Β (or  µΑ - µ< J %'  $The salaries are not e"ual.'

    The 01 output is as follows.

    *a ,yp test #%

  • 8/20/2019 6300p2a (1)

    11/19

    Comparin( =alaries $x6#%%%'

    Test Statistic

    7#

    7

    ##.

    %#

    nn s

     x xt 

     p  −

    −−

    =

      = @.**

     The critical value of t with df 9 n# n7 ) 7 9 #8 is #.E@B.

     ote the calculated value of t is (reater than the critical value of t.

    @.* W #.E@B or use: p O %.%%#

    Thus reect the null hypothesis $with p O %.%%#'. Conclude that the salaries are different. -xternalemployees have a (reater salary than internal employees.

    hec. the assumptions of the t-test

     ote that the t/test for the difference in means assumes that the samples were randomly and

    independently selected. ou also assume the populations are normally distributed with e"ualvariances. It is (enerally held that for lar(e samples the t/test is robust $not sensitive' to moderatedepartures from the assumption of normality. 5 non/parametric test is used if you cannot assume e"ualvariances. -xcel offers a t test that does not assume e"ual variances.

    5n examination of the followin( plots is an informal way to check the assumption of normality and e"uavariances. In this case there is no clear contradiction to this assumption.

    *a ,yp test ##

    In 01 use: ?ata4?ata 5nalysis4 t/test:Two/=ample 5ssumin( -"ual Xariances.

    df stands for de(rees of freedom and isrelated to the sample sie.

     p 9 %.%%% is usually reported as p O %.%%#

    The reference distribution is a t/distribution: tn#n7  ) 7

  • 8/20/2019 6300p2a (1)

    12/19

    40 50 60 70 80 90 100 110

    Salary by Origin

    It is possible to use a statistical test to check the assumption of e"ual variances. The F/test for thedifferences in variances should be used to confirm this observation $shown in the followin('. If variancesare not e"ual use t/Test: Two/=ample 5ssumin( Nne"ual Xariances $or use a nonparametric test'.

    T'o-Sample Hypothesis for !/ual 0ariances

    The F/test can be used to check the assumption of e"ual variances.

    The null and alternate hypotheses are as follows.

    ,%: σ12 = σ22 The population variances are e"ual.

    ,a: σ 12 > σ 22 The population variances are not e"ual.

    F = s12/s2

    2

    5ssumin( α = 0.05

     p W %.%*+ thus do not reect ,% and conclude that the variability is the same.

    *a ,yp test #7

  • 8/20/2019 6300p2a (1)

    13/19

    T'o Dependent Samples Hypothesis Test

    This test is used when you take repeated measures from the same individual4 items or you matchindividuals4items by a certain characteristic. our interest is in the difference between the two relatedmeasures.

    The assumption for this test is that the differences between pairs are approximately normally

    distributed. Note that the test is considered to be robust to violations ofnormality,

    !"ample of a Paired Difference t-Test

    5ssume you send your salespeople to a Kcustomer serviceL trainin( workshop. ,as the trainin( madea difference in the number of complaints> ou collect the followin( data. $Test at the %.%# level'.

    For Discussion

    =uppose you want to know if a trainin( pro(ram helped improve participant scores. ou administer thetest to a random sample and enroll them in trainin(. ou then re/administer an alternate form of thetest. ou record the chan(e score.

    • =tate the hypothesis that you could test.

    • If the mean chan(e score was #% points with a standard error of 7 points+ what percenta(e of

    the participants raised their score by at least #7 points>

    *a ,yp test #3

     

  • 8/20/2019 6300p2a (1)

    14/19

     ote:In testin( population means we apply the entral 1imit Theorem $C1T'. The C1T specifies atheoretical distribution that can be thou(ht of as havin( been formulated by the selection of all possible random samples of a fixed sie n and calculatin( the sample mean for each sample.

    The mean of the distribution of sample means is e"ual to the mean of the population from

    which the samples were drawn.

    The variance of the distribution is σ divided by the s"uare root of n. $it is referred to as the

    standard error.'

    5s the sample sie (ets lar(e the distribution of sample means is approximately normally

    distributed re(ardless of the distribution of the population.

    *a ,yp test #8

  • 8/20/2019 6300p2a (1)

    15/19

    $ne 2ay ,nalysis of 0ariance &,N$0,( Summary

    Nse the one/way 5PX5 to compare multiple means $three or more'. It tests for a difference in the population means / comparin( more than 7 (roups.

    • Compare test scores based on three different instructional methods.

    • Compare stren(th of concrete developed usin( B different forms.

    • Compare the yield of beans based on four different varieties.

    • Compare sales volume for five different locations.

    Pne 2ay 5PX51 5ssume the population means of the (roups are e"ual:

    ,o:  µ1 ! µ " ! # µ r $all (roup means are e"ual'

    7. 2rite an alternative to the null hypothesis ) its lo(ical alternative:

    ,a: not all mean are e"ual. $5t least one (roup mean is not e"ual to another.' ote that this test will tell you if there are any differences in the population means+ it does notspecify which means are different. $There are post hoc tests that will check ) available in 01 add/in.'

    3. Complete the analysis usin( one of the followin(:

    • 01 / ?ata4?ata 5nalysis 45PX5 =in(le Factor 

    8. Compare p to 9 %.%*. If pO %.%* then reect the null hypothesis and conclude that there is at least

    one difference.

    *. If there is a si(nificant difference in the means use the Tukey Yramer procedure to determinewhich (roups differ. $ote that there are other possible procedures.'

    B. =ummarie your results.

    5ssumptions

    • r  samples of sie n1 , n2 ,…,nr  samples are independently and randomly selected.

    •  ormality: values in each (roup are normally distributed $the model is robust relative to this

    assumption i.e. modest departures will not adversely affect the results.';eview the boxplot and review previous sections of the material which presented ways to checknormality.

    • Independence of -rror: ;esiduals are independent for each value. ;esiduals are the differences between the observed and predicted value $difference between an observation and the mean of the(roup for that observation'.Check usin( a probability plot of residuals.

    • ,omo(eneity of Xariance: The variance within each population should be e"ual for all populations

     $σ#7 9 σ77 9 Z9 σr 7' ote: This assumption is important to the 5PX5.

    ou can use 1eveneHs test to check.

    *a ,yp test #*

  • 8/20/2019 6300p2a (1)

    16/19

    Initial Investigation

    • Calculate and look at the means for each treatment (roup.

    • Compare the variances for each (roup $are they approximately the same>'

    • -xamine side/by/side dot plot by treatment ) pay attention to the spread to see if

    variances are approximately e"ual. 5lso consider / ?o they overlap>

    !"ample for Discussion

    Nse the =alary4-xperience data set. 5t the %.%* si(nificance level is there a difference in mean salaryamon( the experience (roups of the employees>

    ,i(h 1evel -xperience: employees with #% or more years of experience in thefield

    Moderate -xperience: employees with B to E years of experience in the field

    1ow -xperience: employees with * or less years of experience in the fieldIt is always (ood to look over the data ) the descriptive statistics for the (roups / and the boxplot.

    Salary &"34***( by !"perience

    ;eview the boxplot of salaries by experience. Consider spread and center.

    *a ,yp test #B

  • 8/20/2019 6300p2a (1)

    17/19

    40 50 60 70 80 90 100 110

    Salary by Experience

    The test would be developed as follows.

    1. ,%: μ1 = μ2 = μ35 ,a: not all population means are e"ual

    3. In 01 Nse ?ata4 ?ata 5nalysis4 5PX5: =in(le Factor to obtain the followin( output.

    8. p 9 %.%%%%* O %.%%%* ;eect ,o and conclude there is at least one difference.

    *. Tukey Yramer $developed with an 01 add/on' is a post hoc test. ou have established that at leasttwo (roups differ in salary. Nse this test to determine which experience level (roups differ in

    salary.

    *a ,yp test #@

    In 01 use:5nova =in(le/Factor.

  • 8/20/2019 6300p2a (1)

    18/19

     ote: 1ook at the boxplot to see if the assumption of e"ual variances is viable. ou can test thishypothesis usin( the 1eveneHs Test $available from 01 add/on'.

    1evene6s Test for 0ariances#

    Tests the followin(

    ,o: σ21 = σ2

    2 = σ2

    3  

    ,a: ot all  σ2 j are e"ual  

    ?o not reect ,o with p 9 %.E#7.  

    The followin( chart is shows some of the available alternatives in hypothesis testin(.

    *a ,yp test #

  • 8/20/2019 6300p2a (1)

    19/19

    *a ,yp test #E