statistics

93
Introduction to Probability and Statistics Nagarajan Krishnamurthy Introduction to Business Statistics for EPGP 2015-16 batch Indian Institute of Management Indore Thanks to Prof. Arun Kumar and Prof. Ravindra Gokhale, co-instructors of QT1, AY 2012-13

Upload: sourav-sharma

Post on 15-Sep-2015

212 views

Category:

Documents


0 download

DESCRIPTION

Session1

TRANSCRIPT

  • Introduction to Probability and Statistics

    Nagarajan Krishnamurthy

    Introduction to Business Statistics for EPGP 2015-16 batchIndian Institute of Management Indore

    Thanks to Prof. Arun Kumar and Prof. Ravindra Gokhale,co-instructors of QT1, AY 2012-13

  • Part 1: Summarizing and Visualizing a Data Set

  • Types of Data

    Quantitative Data: Data for which arithmetic operationsmakes sense. E.g.: Age, Salary, Length.

    Categorical Data: Data obtained by putting individuals indifferent categories. E.g.: Gender, States of a country

  • Visualization

    Quantitative Data: Histogram, Stem-Leaf plot, Box plot

    Categorical Data: Pie Chart, Bar chart

    *Discuss Cafe data (using Excel)

  • Interpreting a Histogram

    Shape: symmetric, skewed;unimodal, bimodal, ...;leptokurtic, platykurtic, mesokurtic

    Center: mean, median

    Spread: range, standard deviation, inter-quartile range

  • Measure of the central tendency of a data set

    Mean: If we have a data set x1, . . . , xn then mean of the dataset is x1++xn

    n.

    Notation: x

  • Mean: Example

    The mean of 0,5,1,1,3 is 2.

  • Measure of the Central Tendency of a Data Set

    Median: Middle number in a sorted data set. When thenumber of observations (sample size) is an even number thenthere are two middle numbers. In that case, we take averageof the two middle numbers to obtain the median.

    Notations: x

  • Median: Example 1

    For example the median of 0,5,1,1,3 is 1 because 1 is themiddle number of the sorted data i.e. 0,1,1,3,5.

  • Median: Example 2

    The median of 3,2,5,6,4,5,3,5 is 4.5 because 4.5 is the averageof the two middle numbers of the sorted data i.e.2,3,3,4,5,5,5,6.

  • Measure of the Central Tendency of a Data Set

    Mode: Observation in the data set with the largest frequency.Note that we can have more than one mode for a data set.

  • Mode: Example

    For example the mode of 0,5,1,1,3 is 1.

  • Effect of an Outlier

    Calculate mean, median, and mode of 0,5,1,1,3,100.

    mean=18.33, median=2, mode=1.

  • Effect of an Outlier

    Calculate mean, median, and mode of 0,5,1,1,3,100.

    mean=18.33, median=2, mode=1.

  • Effect of an Outlier

    Outlier pulls mean towards it but may not affect median andmode.

  • Identifying Relation Between Mean and Median

    from Histogram

  • Identifying Relation Between Mean and Median

    from Histogram

    Symmetric: mean median

  • Identifying Relation Between Mean and Median

    from Histogram

    Symmetric: mean median

    Left skewed: Mean < Median < Mode (in general)

    Right skewed: Mean > Median > Mode (in general)

  • Kurtosis

    Leptokurtic

    Platykurtic

    Mesokurtic

  • Measure of the Spread of a Data Set

    Range: max-min

    Ex: 0,5,1,1,3; what is the range?

    Range = 5 0 = 5.

  • Measure of the Spread of a Data Set

    Range: max-min

    Ex: 0,5,1,1,3; what is the range?Range = 5 0 = 5.

  • Measure of the Spread of a Data Set

    Variance:n

    i=1(xix)2n1

    Standard deviation:n

    i=1(xix)2n1

  • Variance and Standard Deviation: Example

    What is the variance and the standard deviation of 3,3,3,3,3?

    Ans. variance=0 standard deviation=0

    What is the variance and the standard deviation of 1,2,3,4,5?Ans. variance=2.5 standard deviation=1.58

  • Variance and Standard Deviation: Example

    What is the variance and the standard deviation of 3,3,3,3,3?Ans. variance=0 standard deviation=0

    What is the variance and the standard deviation of 1,2,3,4,5?

    Ans. variance=2.5 standard deviation=1.58

  • Variance and Standard Deviation: Example

    What is the variance and the standard deviation of 3,3,3,3,3?Ans. variance=0 standard deviation=0

    What is the variance and the standard deviation of 1,2,3,4,5?Ans. variance=2.5 standard deviation=1.58

  • Standard Deviation

    Standard deviation is always greater than or equal to zero.

  • Does Standard Deviation Gets Affected by

    Outliers?

    What is the standard deviation for the data 3,3,3,3,100?

    Ans. 43.38

  • Does Standard Deviation Gets Affected by

    Outliers?

    What is the standard deviation for the data 3,3,3,3,100?

    Ans. 43.38

  • Is Standard Deviation Always a Good Measure of

    the Spread of a Data Set?

    Not a good measure when data is skewed or has outliers.

  • Is Standard Deviation Always a Good Measure of

    the Spread of a Data Set?

    Not a good measure when data is skewed or has outliers.

  • Quartiles

    First quartile: 25th percentile

    Notation: Q1

  • Quartiles

    Third quartile: 75th percentile

    Notation: Q3

  • Exercise

    Find the first and third quartile of 8,7,1,4,6,6,4,5,7,6,3,0.

    Ans. The sorted data is 0,1,3,4,4,5,6,6,6,7,7,8. The median ofthe red half of the data is 3.5 (Q1) and the median of the bluehalf of the data is 6.5 (Q3).

  • Exercise

    Find the first and third quartile of 8,7,1,4,6,6,4,5,7,6,3,0.

    Ans. The sorted data is 0,1,3,4,4,5,6,6,6,7,7,8. The median ofthe red half of the data is 3.5 (Q1) and the median of the bluehalf of the data is 6.5 (Q3).

  • Quartiles

    Median is the second quartile (Q2).

  • Measure of the Spread of a Data Set

    Inter Quartile Range (IQR): Q3 Q1

    *IQR is a robust measure of spread. IQR does not get affectedmuch by skewness or outliers.

  • Exercise

    Find IQR of 8,7,1,4,6,6,4,5,7,6,3,0.

    Q3-Q1=6.5-3.5=3.

  • Exercise

    Find IQR of 8,7,1,4,6,6,4,5,7,6,3,0.

    Q3-Q1=6.5-3.5=3.

  • Five Number Summary

    Minimum

    First quartile

    Median

    Third quartile

    Maximum

  • Boxplot

    *We will create a box plot for the Cafe data set.

  • Interpreting a Box Plot

    Shape:

    Outliers: Any observation not in the range[Q1 1.5 IQR,Q3 + 1.5 IQR] is considered an outlier(Informal Rule).

  • Why Do We Need Box Plot?

    To compare two or more data sets.

    Visualization of summary statistics.

  • Categorical Data Visualization

    *Bar Chart

    *Pie Chart

    Show billionaires data.

  • Part 2: Introduction to Probability

  • Describing Shape of a Bar Graph

    Proportion of observations in a particular category.

  • Describing Shape of a Histogram

    Proportion of observations in a particular class interval.

  • Probability

    Proportion sample

    Probability population

  • Example

    Workforce distribution in the United States.

    Industry ProbabilityAgriculture 0.130Construction 0.147Finance, Insurance, Real Estate 0.059Manufacturing 0.042Mining 0.002Services 0.419Trade 0.159Transportation, Public Utilities 0.042

  • Sample Space

    Def: Set of all possible outcomes.

    E.g.: ={Agriculture, Construction, . . . , Services, Trade,Transportation and Public Utilities}

  • Simple Events

    Simple event: An event in the finest partition of the samplespace.

    Example: 1=Agriculture, 2=Construction.

  • Event

    Def: Any subset of the sample space

    E.g.: {Agriculture, Construction}

  • Exercise

    A bowl contains three red and two yellow balls. Two balls arerandomly selected and their colors recorded. Use a treediagram to list the 20 simple events in the experiment, keepingin mind the order in which the balls are drawn.

  • Other Approaches for Calculating Probabilities

    Classical Approach: Assuming all outcomes to be equallylikely, the probability of an event is the number of favourableoutcomes divided by the total number of outcomes.E.g. Rolling a dice

    Subjective Approach: Assigning probability to an event basedon ones experience.

  • Example

    Workforce distribution in the United States.

    Industry ProbabilityAgriculture 0.130Construction 0.147Finance, Insurance, Real Estate 0.059Manufacturing 0.042Mining 0.002Services 0.419Trade 0.159Transportation, Public Utilities 0.042

  • Probability

    P(Agriculture)

    = 0.13

    P(Either Agriculture or Construction or both) P(Agriculture Construction) = 0.13+0.147=0.277.

    P(Agriculture and Construction) P(Agriculture Construction) =0.

    P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.

  • Probability

    P(Agriculture) = 0.13

    P(Either Agriculture or Construction or both) P(Agriculture Construction)

    = 0.13+0.147=0.277.

    P(Agriculture and Construction) P(Agriculture Construction) =0.

    P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.

  • Probability

    P(Agriculture) = 0.13

    P(Either Agriculture or Construction or both) P(Agriculture Construction) = 0.13+0.147=0.277.

    P(Agriculture and Construction) P(Agriculture Construction)

    =0.

    P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.

  • Probability

    P(Agriculture) = 0.13

    P(Either Agriculture or Construction or both) P(Agriculture Construction) = 0.13+0.147=0.277.

    P(Agriculture and Construction) P(Agriculture Construction) =0.

    P(Not in Agriculture) P(Agriculturec)

    = 1-0.13=0.87.

  • Probability

    P(Agriculture) = 0.13

    P(Either Agriculture or Construction or both) P(Agriculture Construction) = 0.13+0.147=0.277.

    P(Agriculture and Construction) P(Agriculture Construction) =0.

    P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.

  • Compound Events

    If A and B are two events then

    Union event is A B

    Intersection event is A B

    Complement event is Ac

  • Venn Diagram Representation

    8

    A B

    S

    Disjoint events A and B A B

    A

    S

    B

    U

    A U B

    A

    S

    B

    C

    BS

    Mutually exclusive and exhaustiveevents: A, B, C, and D

    A

    D

  • Probability Rules

    1 P(A B) = P(A) + P(B) P(A B)2 P(Ac) = 1 P(A)

  • Mutually Exclusive

    Def: Two events are mutually exclusive if they do not haveany common outcome.

    E.g.: Agriculture and Construction are mutually exclusiveevents.

  • Mutually Exclusive

    A and B are mutually exclusive if P(A B) = 0.

    This implies that for mutually exclusive events A and B,P(A B) = P(A)+P(B).

  • Pizza Venn Diagram

  • What is the sample space?

    Sample space={Tomato only, Fish Only, Mushroom-Tomato,Mushroom-Tomato-Fish, Mushroom-Fish, No toppings}.

  • What is the sample space?

    Sample space={Tomato only, Fish Only, Mushroom-Tomato,Mushroom-Tomato-Fish, Mushroom-Fish, No toppings}.

  • Probability of the events in the sample space

    P(Tomato only)

    =2/8; P(Fish only)=1/8.

    P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.

    P(Mushroom-Fish) =1/8; P(No toppings)=1/8.

  • Probability of the events in the sample space

    P(Tomato only) =2/8; P(Fish only)

    =1/8.

    P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.

    P(Mushroom-Fish) =1/8; P(No toppings)=1/8.

  • Probability of the events in the sample space

    P(Tomato only) =2/8; P(Fish only)=1/8.

    P(Mushroom-Tomato)

    =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.

    P(Mushroom-Fish) =1/8; P(No toppings)=1/8.

  • Probability of the events in the sample space

    P(Tomato only) =2/8; P(Fish only)=1/8.

    P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)

    =1/8.

    P(Mushroom-Fish) =1/8; P(No toppings)=1/8.

  • Probability of the events in the sample space

    P(Tomato only) =2/8; P(Fish only)=1/8.

    P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.

    P(Mushroom-Fish)

    =1/8; P(No toppings)=1/8.

  • Probability of the events in the sample space

    P(Tomato only) =2/8; P(Fish only)=1/8.

    P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.

    P(Mushroom-Fish) =1/8; P(No toppings)

    =1/8.

  • Probability of the events in the sample space

    P(Tomato only) =2/8; P(Fish only)=1/8.

    P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.

    P(Mushroom-Fish) =1/8; P(No toppings)=1/8.

  • Union Rule

    What is the probability that your slice will have tomato ormushroom?

    Ans. 6/8=3/4

  • Union Rule

    What is the probability that your slice will have tomato ormushroom?

    Ans. 6/8=3/4

  • Intersection Rule

    What is the probability that your slice will have tomato andmushroom?

    Ans. 3/8

  • Intersection Rule

    What is the probability that your slice will have tomato andmushroom?

    Ans. 3/8

  • Complement Rule

    What is the probability that your slice will not have tomato?

    Ans. 3/8

  • Complement Rule

    What is the probability that your slice will not have tomato?

    Ans. 3/8

  • Conditional Probability

    You have pulled out a slice of pizza that has tomato on it.What is the probability that your slice will have mushrooms?

    Ans. 3/5.

  • Conditional Probability

    Def: Probability of event A in event B. That is, probabilitythat even A occurs given than B occurs.

    Notation: A|B

  • Multiplication rule

    P(A B) = P(A)P(B |A)P(A B) = P(B)P(A|B)

  • Statistical Independence

    Two events are said to be independent if the occurrence ofone has no effect on the chance of occurrence of the other.

  • Statistical Independence

    Two events A and B are considered independent whenP(A|B)=P(A).

  • Exercise 1

    Is gender related to whether someone voted in the last mayoralelection? Answer the question using the joint probabilitiesgiven in the table below.

    GenderVoted in the last mayoral election Female MaleYes 0.25 0.18No 0.33 0.24

  • Statistical Independence

    If two events A and B are independent then

    1 P(A B) = P(A)P(B)

  • Law of Total Probability

    Given a set of events S1, S2, . . . , Sk that are mutually exclusiveand exhaustive, and an event A, the probability of the event Acan be expressed as

    P(A) = P(S1).P(A|S1) + P(S2).P(A|S2)+P(S3).P(A|S3) + . . . + P(Sk).P(A|Sk)

  • Exercise 2

    A business group owns three five-star hotels (say, A, B, and C)in India. By studying the past behavior of the revenueobtained from the three hotels month by month, it has beenobserved that the probability of increase in revenue of either Bor C or both of them is 0.5. If As revenue increases in a givenmonth, the probability of increase in Bs revenue is 0.7, theprobability of increase in Cs revenue is 0.6, and the probabilityof increase in both B and Cs revenue is 0.5. However if Asrevenue does not increase in a given month, the probability ofincrease in Bs revenue is 0.2, the probability of increase in Csrevenue is 0.3, and the probability of increase in both B andCs revenue is 0.1. What is the probability that the revenue ofall the three hotels, A, B, and C, increase in a given month?

  • Exercise 3

    You are a physician. You think it is quite likely that one of your patients has strep

    throat, but you are not sure. You take some swabs from the throat and send them to

    a lab for testing. The test is (like nearly all lab tests) not perfect. If the patient has

    strep throat, then 70% of the time the lab says YES but 30% of the time it says NO.

    If the patient does not have strep throat, then 90% of the time the lab says NO but

    10% of the time it says YES. You send five succesive swabs to the lab, from the same

    patient. You get back these results, in order; YNYNY. What do you conclude?

    These results are worthless.

    It is likely that the patient does not have the strep throat.

    It is slightly more likely than not, that patient does have the strep throat.

    It is very much more likely than not, that patient does have the strep throat.

  • Bayes Rule

    Let S1, S2, . . . , Sk represents k mutually exclusive andexhaustive sub-populations with prior probabilitiesP(S1),P(S2), . . . ,P(S2). If an event A occurs, the posteriorprobability of Si given A is the conditional probability

    P(Si |A) = P(Si).P(A|Si)kj=1 P(Sj).P(A|Sj)

  • Exercise

    Strep Throat Exercise

  • Bibliography

    An Introduction to Probability and Inductive Logic, by IanHacking

    Introduction to Probability and Statistics, by WilliamMendenhall, Robert J. Beaver, and Barbara M. Beaver

    Practice of Business Statistics, by David S. Moore, GeorgeP. McCabe, William M. Duckworth, and Stanley L. Sclove

    Bradley A. Warner, David Pendergrift, and TimothyWebb,That was Venn, This is now, Journal ofStatistical Education, Volume 6, Number 1, 1998