Download - Statistics
-
Introduction to Probability and Statistics
Nagarajan Krishnamurthy
Introduction to Business Statistics for EPGP 2015-16 batchIndian Institute of Management Indore
Thanks to Prof. Arun Kumar and Prof. Ravindra Gokhale,co-instructors of QT1, AY 2012-13
-
Part 1: Summarizing and Visualizing a Data Set
-
Types of Data
Quantitative Data: Data for which arithmetic operationsmakes sense. E.g.: Age, Salary, Length.
Categorical Data: Data obtained by putting individuals indifferent categories. E.g.: Gender, States of a country
-
Visualization
Quantitative Data: Histogram, Stem-Leaf plot, Box plot
Categorical Data: Pie Chart, Bar chart
*Discuss Cafe data (using Excel)
-
Interpreting a Histogram
Shape: symmetric, skewed;unimodal, bimodal, ...;leptokurtic, platykurtic, mesokurtic
Center: mean, median
Spread: range, standard deviation, inter-quartile range
-
Measure of the central tendency of a data set
Mean: If we have a data set x1, . . . , xn then mean of the dataset is x1++xn
n.
Notation: x
-
Mean: Example
The mean of 0,5,1,1,3 is 2.
-
Measure of the Central Tendency of a Data Set
Median: Middle number in a sorted data set. When thenumber of observations (sample size) is an even number thenthere are two middle numbers. In that case, we take averageof the two middle numbers to obtain the median.
Notations: x
-
Median: Example 1
For example the median of 0,5,1,1,3 is 1 because 1 is themiddle number of the sorted data i.e. 0,1,1,3,5.
-
Median: Example 2
The median of 3,2,5,6,4,5,3,5 is 4.5 because 4.5 is the averageof the two middle numbers of the sorted data i.e.2,3,3,4,5,5,5,6.
-
Measure of the Central Tendency of a Data Set
Mode: Observation in the data set with the largest frequency.Note that we can have more than one mode for a data set.
-
Mode: Example
For example the mode of 0,5,1,1,3 is 1.
-
Effect of an Outlier
Calculate mean, median, and mode of 0,5,1,1,3,100.
mean=18.33, median=2, mode=1.
-
Effect of an Outlier
Calculate mean, median, and mode of 0,5,1,1,3,100.
mean=18.33, median=2, mode=1.
-
Effect of an Outlier
Outlier pulls mean towards it but may not affect median andmode.
-
Identifying Relation Between Mean and Median
from Histogram
-
Identifying Relation Between Mean and Median
from Histogram
Symmetric: mean median
-
Identifying Relation Between Mean and Median
from Histogram
Symmetric: mean median
Left skewed: Mean < Median < Mode (in general)
Right skewed: Mean > Median > Mode (in general)
-
Kurtosis
Leptokurtic
Platykurtic
Mesokurtic
-
Measure of the Spread of a Data Set
Range: max-min
Ex: 0,5,1,1,3; what is the range?
Range = 5 0 = 5.
-
Measure of the Spread of a Data Set
Range: max-min
Ex: 0,5,1,1,3; what is the range?Range = 5 0 = 5.
-
Measure of the Spread of a Data Set
Variance:n
i=1(xix)2n1
Standard deviation:n
i=1(xix)2n1
-
Variance and Standard Deviation: Example
What is the variance and the standard deviation of 3,3,3,3,3?
Ans. variance=0 standard deviation=0
What is the variance and the standard deviation of 1,2,3,4,5?Ans. variance=2.5 standard deviation=1.58
-
Variance and Standard Deviation: Example
What is the variance and the standard deviation of 3,3,3,3,3?Ans. variance=0 standard deviation=0
What is the variance and the standard deviation of 1,2,3,4,5?
Ans. variance=2.5 standard deviation=1.58
-
Variance and Standard Deviation: Example
What is the variance and the standard deviation of 3,3,3,3,3?Ans. variance=0 standard deviation=0
What is the variance and the standard deviation of 1,2,3,4,5?Ans. variance=2.5 standard deviation=1.58
-
Standard Deviation
Standard deviation is always greater than or equal to zero.
-
Does Standard Deviation Gets Affected by
Outliers?
What is the standard deviation for the data 3,3,3,3,100?
Ans. 43.38
-
Does Standard Deviation Gets Affected by
Outliers?
What is the standard deviation for the data 3,3,3,3,100?
Ans. 43.38
-
Is Standard Deviation Always a Good Measure of
the Spread of a Data Set?
Not a good measure when data is skewed or has outliers.
-
Is Standard Deviation Always a Good Measure of
the Spread of a Data Set?
Not a good measure when data is skewed or has outliers.
-
Quartiles
First quartile: 25th percentile
Notation: Q1
-
Quartiles
Third quartile: 75th percentile
Notation: Q3
-
Exercise
Find the first and third quartile of 8,7,1,4,6,6,4,5,7,6,3,0.
Ans. The sorted data is 0,1,3,4,4,5,6,6,6,7,7,8. The median ofthe red half of the data is 3.5 (Q1) and the median of the bluehalf of the data is 6.5 (Q3).
-
Exercise
Find the first and third quartile of 8,7,1,4,6,6,4,5,7,6,3,0.
Ans. The sorted data is 0,1,3,4,4,5,6,6,6,7,7,8. The median ofthe red half of the data is 3.5 (Q1) and the median of the bluehalf of the data is 6.5 (Q3).
-
Quartiles
Median is the second quartile (Q2).
-
Measure of the Spread of a Data Set
Inter Quartile Range (IQR): Q3 Q1
*IQR is a robust measure of spread. IQR does not get affectedmuch by skewness or outliers.
-
Exercise
Find IQR of 8,7,1,4,6,6,4,5,7,6,3,0.
Q3-Q1=6.5-3.5=3.
-
Exercise
Find IQR of 8,7,1,4,6,6,4,5,7,6,3,0.
Q3-Q1=6.5-3.5=3.
-
Five Number Summary
Minimum
First quartile
Median
Third quartile
Maximum
-
Boxplot
*We will create a box plot for the Cafe data set.
-
Interpreting a Box Plot
Shape:
Outliers: Any observation not in the range[Q1 1.5 IQR,Q3 + 1.5 IQR] is considered an outlier(Informal Rule).
-
Why Do We Need Box Plot?
To compare two or more data sets.
Visualization of summary statistics.
-
Categorical Data Visualization
*Bar Chart
*Pie Chart
Show billionaires data.
-
Part 2: Introduction to Probability
-
Describing Shape of a Bar Graph
Proportion of observations in a particular category.
-
Describing Shape of a Histogram
Proportion of observations in a particular class interval.
-
Probability
Proportion sample
Probability population
-
Example
Workforce distribution in the United States.
Industry ProbabilityAgriculture 0.130Construction 0.147Finance, Insurance, Real Estate 0.059Manufacturing 0.042Mining 0.002Services 0.419Trade 0.159Transportation, Public Utilities 0.042
-
Sample Space
Def: Set of all possible outcomes.
E.g.: ={Agriculture, Construction, . . . , Services, Trade,Transportation and Public Utilities}
-
Simple Events
Simple event: An event in the finest partition of the samplespace.
Example: 1=Agriculture, 2=Construction.
-
Event
Def: Any subset of the sample space
E.g.: {Agriculture, Construction}
-
Exercise
A bowl contains three red and two yellow balls. Two balls arerandomly selected and their colors recorded. Use a treediagram to list the 20 simple events in the experiment, keepingin mind the order in which the balls are drawn.
-
Other Approaches for Calculating Probabilities
Classical Approach: Assuming all outcomes to be equallylikely, the probability of an event is the number of favourableoutcomes divided by the total number of outcomes.E.g. Rolling a dice
Subjective Approach: Assigning probability to an event basedon ones experience.
-
Example
Workforce distribution in the United States.
Industry ProbabilityAgriculture 0.130Construction 0.147Finance, Insurance, Real Estate 0.059Manufacturing 0.042Mining 0.002Services 0.419Trade 0.159Transportation, Public Utilities 0.042
-
Probability
P(Agriculture)
= 0.13
P(Either Agriculture or Construction or both) P(Agriculture Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction) P(Agriculture Construction) =0.
P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.
-
Probability
P(Agriculture) = 0.13
P(Either Agriculture or Construction or both) P(Agriculture Construction)
= 0.13+0.147=0.277.
P(Agriculture and Construction) P(Agriculture Construction) =0.
P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.
-
Probability
P(Agriculture) = 0.13
P(Either Agriculture or Construction or both) P(Agriculture Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction) P(Agriculture Construction)
=0.
P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.
-
Probability
P(Agriculture) = 0.13
P(Either Agriculture or Construction or both) P(Agriculture Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction) P(Agriculture Construction) =0.
P(Not in Agriculture) P(Agriculturec)
= 1-0.13=0.87.
-
Probability
P(Agriculture) = 0.13
P(Either Agriculture or Construction or both) P(Agriculture Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction) P(Agriculture Construction) =0.
P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.
-
Compound Events
If A and B are two events then
Union event is A B
Intersection event is A B
Complement event is Ac
-
Venn Diagram Representation
8
A B
S
Disjoint events A and B A B
A
S
B
U
A U B
A
S
B
C
BS
Mutually exclusive and exhaustiveevents: A, B, C, and D
A
D
-
Probability Rules
1 P(A B) = P(A) + P(B) P(A B)2 P(Ac) = 1 P(A)
-
Mutually Exclusive
Def: Two events are mutually exclusive if they do not haveany common outcome.
E.g.: Agriculture and Construction are mutually exclusiveevents.
-
Mutually Exclusive
A and B are mutually exclusive if P(A B) = 0.
This implies that for mutually exclusive events A and B,P(A B) = P(A)+P(B).
-
Pizza Venn Diagram
-
What is the sample space?
Sample space={Tomato only, Fish Only, Mushroom-Tomato,Mushroom-Tomato-Fish, Mushroom-Fish, No toppings}.
-
What is the sample space?
Sample space={Tomato only, Fish Only, Mushroom-Tomato,Mushroom-Tomato-Fish, Mushroom-Fish, No toppings}.
-
Probability of the events in the sample space
P(Tomato only)
=2/8; P(Fish only)=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)
=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)=1/8.
P(Mushroom-Tomato)
=2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)
=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish)
=1/8; P(No toppings)=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)
=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)=1/8.
-
Union Rule
What is the probability that your slice will have tomato ormushroom?
Ans. 6/8=3/4
-
Union Rule
What is the probability that your slice will have tomato ormushroom?
Ans. 6/8=3/4
-
Intersection Rule
What is the probability that your slice will have tomato andmushroom?
Ans. 3/8
-
Intersection Rule
What is the probability that your slice will have tomato andmushroom?
Ans. 3/8
-
Complement Rule
What is the probability that your slice will not have tomato?
Ans. 3/8
-
Complement Rule
What is the probability that your slice will not have tomato?
Ans. 3/8
-
Conditional Probability
You have pulled out a slice of pizza that has tomato on it.What is the probability that your slice will have mushrooms?
Ans. 3/5.
-
Conditional Probability
Def: Probability of event A in event B. That is, probabilitythat even A occurs given than B occurs.
Notation: A|B
-
Multiplication rule
P(A B) = P(A)P(B |A)P(A B) = P(B)P(A|B)
-
Statistical Independence
Two events are said to be independent if the occurrence ofone has no effect on the chance of occurrence of the other.
-
Statistical Independence
Two events A and B are considered independent whenP(A|B)=P(A).
-
Exercise 1
Is gender related to whether someone voted in the last mayoralelection? Answer the question using the joint probabilitiesgiven in the table below.
GenderVoted in the last mayoral election Female MaleYes 0.25 0.18No 0.33 0.24
-
Statistical Independence
If two events A and B are independent then
1 P(A B) = P(A)P(B)
-
Law of Total Probability
Given a set of events S1, S2, . . . , Sk that are mutually exclusiveand exhaustive, and an event A, the probability of the event Acan be expressed as
P(A) = P(S1).P(A|S1) + P(S2).P(A|S2)+P(S3).P(A|S3) + . . . + P(Sk).P(A|Sk)
-
Exercise 2
A business group owns three five-star hotels (say, A, B, and C)in India. By studying the past behavior of the revenueobtained from the three hotels month by month, it has beenobserved that the probability of increase in revenue of either Bor C or both of them is 0.5. If As revenue increases in a givenmonth, the probability of increase in Bs revenue is 0.7, theprobability of increase in Cs revenue is 0.6, and the probabilityof increase in both B and Cs revenue is 0.5. However if Asrevenue does not increase in a given month, the probability ofincrease in Bs revenue is 0.2, the probability of increase in Csrevenue is 0.3, and the probability of increase in both B andCs revenue is 0.1. What is the probability that the revenue ofall the three hotels, A, B, and C, increase in a given month?
-
Exercise 3
You are a physician. You think it is quite likely that one of your patients has strep
throat, but you are not sure. You take some swabs from the throat and send them to
a lab for testing. The test is (like nearly all lab tests) not perfect. If the patient has
strep throat, then 70% of the time the lab says YES but 30% of the time it says NO.
If the patient does not have strep throat, then 90% of the time the lab says NO but
10% of the time it says YES. You send five succesive swabs to the lab, from the same
patient. You get back these results, in order; YNYNY. What do you conclude?
These results are worthless.
It is likely that the patient does not have the strep throat.
It is slightly more likely than not, that patient does have the strep throat.
It is very much more likely than not, that patient does have the strep throat.
-
Bayes Rule
Let S1, S2, . . . , Sk represents k mutually exclusive andexhaustive sub-populations with prior probabilitiesP(S1),P(S2), . . . ,P(S2). If an event A occurs, the posteriorprobability of Si given A is the conditional probability
P(Si |A) = P(Si).P(A|Si)kj=1 P(Sj).P(A|Sj)
-
Exercise
Strep Throat Exercise
-
Bibliography
An Introduction to Probability and Inductive Logic, by IanHacking
Introduction to Probability and Statistics, by WilliamMendenhall, Robert J. Beaver, and Barbara M. Beaver
Practice of Business Statistics, by David S. Moore, GeorgeP. McCabe, William M. Duckworth, and Stanley L. Sclove
Bradley A. Warner, David Pendergrift, and TimothyWebb,That was Venn, This is now, Journal ofStatistical Education, Volume 6, Number 1, 1998