Download - Basic Statistics.pdf
-
8/14/2019 Basic Statistics.pdf
1/55
RESEARCH METHODS
Dr. M. Shakaib AkramEmail: [email protected]
-
8/14/2019 Basic Statistics.pdf
2/55
BASIC STATISTICS
-
8/14/2019 Basic Statistics.pdf
3/55
Raw Data3
Raw data have not been manipulated or treated in anyway beyond their original collection
Ex: speeds (mph) of 105 vehicles
-
8/14/2019 Basic Statistics.pdf
4/55
Frequency Distribution4
A table that divides the data values into classes andshows the number of observed values that fall into eachclass
-
8/14/2019 Basic Statistics.pdf
5/55
Frequency Distribution5
Relative Frequency Distribution
the proportion or percentage of data values thatfall within each category
Cumulative relative frequency distribution the number of observations that are within or
below each of the classes
-
8/14/2019 Basic Statistics.pdf
6/55
Histogram6
The histogram describes a frequency distribution by using a seriesof adjacent rectangles, each of which has a length that isproportional to the frequency of the observations within the rangeof values it represents
-
8/14/2019 Basic Statistics.pdf
7/55
The Stem-and-Leaf Display7
a variant of the frequency distribution, uses asubset of the original digits as class descriptors Ex: The raw data are the numbers of Congressional bills vetoed during the administrations of
seven U.S. presidents, from Johnson to Clinton
Stem-and-leaf diagram
-
8/14/2019 Basic Statistics.pdf
8/55
Bar Chart8
Bar chart represents frequencies according tothe relative lengths of a set of rectangles
Bar Chart vs histogram: Histogramquantitative/continous data
Bar chart qualitative/categorical data
adjacent rectangles in the histogram share a common side, whilethose in the bar chart have a gap between them
-
8/14/2019 Basic Statistics.pdf
9/55
The Scatterplot
A scatter diagram is a two-dimensional plot ofdata representing values of two quantitativevariables.
x, the independent variable, on the horizontal axis y, the dependent variable, on the vertical axis
Four ways in which two variables can be related:
1. Direct
2. Inverse
3. Curvilinear
4. No relationship
-
8/14/2019 Basic Statistics.pdf
10/55
The Scatterplot10
-
8/14/2019 Basic Statistics.pdf
11/55
Coefficient of correlation (r)
Coefficient of correlation, r Direction of the relationship:
direct (r > 0) or inverse (r < 0)
Strength of the relationship:When r is close to 1 or 1, the linear relationshipbetween x andy is strong. When r is close to 0, thelinear relationship between x andy is weak. When r =0, there is no linear relationship between x andy.
Coefficient of determination, r2 The percent of total variation iny that is explained by
variation in x.
11
-
8/14/2019 Basic Statistics.pdf
12/55
Coefficient of correlation12
-
8/14/2019 Basic Statistics.pdf
13/55
Coefficient of correlation13
-
8/14/2019 Basic Statistics.pdf
14/55
Coefficient of correlation14
-
8/14/2019 Basic Statistics.pdf
15/55
The Center: Mean
-
8/14/2019 Basic Statistics.pdf
16/55
The Center: Median
To find the median: 1. Put the data in an array.
2A. If the data set has an ODD number of numbers, themedian is the middle value.
2B. If the data set has an EVEN number of numbers, themedian is the AVERAGE of the middle two values.
(Note that the median of an even set of data values is notnecessarily a member of the set of values.)
The median is particularly useful if there areoutliers in the data set, which otherwise tend toinfluence the value of an arithmetic mean.
-
8/14/2019 Basic Statistics.pdf
17/55
The Center: Mode
The mode is the most frequent value.
While there is just one value for the mean and one valuefor the median, there may be more than one value forthe mode of a data set.
The mode tends to be less frequently used than themean or the median.
-
8/14/2019 Basic Statistics.pdf
18/55
The Spread: Range
The range is the distancebetween the smallestand the largest data value in the set.
Range = largest value
smallest value
Sometimes range is reported as an interval,anchored between the smallest and largest
data value, rather than the actual width of thatinterval.
-
8/14/2019 Basic Statistics.pdf
19/55
The Spread: Variance
How far a set of numbers is spread out
Variance is one of the most frequently usedmeasures of spread
-
8/14/2019 Basic Statistics.pdf
20/55
The Spread: Standard Deviation
A measure of the dispersion of a set of datafrom its mean
Mathematically: the square root of variance
for a population,
for a sample,
2
s
s
2
-
8/14/2019 Basic Statistics.pdf
21/55
Example: Standard Deviation
Two classes took a recent quiz. There were 10students in each class, and each class had anaverage score of 81.5
-
8/14/2019 Basic Statistics.pdf
22/55
Since the averages are the same,can we assume that the studentsin both classes all did pretty much
the same on the exam?
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
23/55
The average (mean) does not tell
us anything about thedistribution or variation in thegrades.
Example: Standard Deviation
The answer is No.
-
8/14/2019 Basic Statistics.pdf
24/55
Here are Dot-Plots of the gradesin each class:
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
25/55
Mean
-
8/14/2019 Basic Statistics.pdf
26/55
So, we need to come up withsome way of measuring not
just the average, but also thespread of the distribution of
our data.
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
27/55
Why not just give an averageand the range of data (the
highest and lowest values) todescribe the distribution of
the data?
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
28/55
Well, for example, lets sayfrom a set of data, the average
is 17.95 and the range is 23.
But what if the data looked likethis:
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
29/55
Here is the average
And here is the range
But really, most of the
numbers are in this area,
and are not evenly
distributed throughout therange.
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
30/55
The Standard Deviation is a
number that measures howfar away each number in a setof data is from their mean.
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
31/55
If the Standard Deviation islarge, it means the numbers
are spread out from theirmean.
If the Standard Deviation issmall, it means the numbers
are close to their mean.
small,
large,
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
32/55
Here arethe scores
on the mathquiz forTeam A:
72
76
80
80
81
83
84
85
85
89
Average:
81.5
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
33/55
The Standard Deviation measures how far away eachnumber in a set of data is from their mean.For example, start with the lowest score, 72. How far away is 72 from the mean of
81.5? 72 - 81.5 = - 9.5
- 9.5
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
34/55
- 9.5
Or, start with the highest score, 89. How far away is 89 from the mean of 81.5?
89 - 81.5 = 7.5
7.5
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
35/55
So, the first
step to findingthe StandardDeviation is tofind all thedistances fromthe mean.
Score Distance from
Mean
72 -9.5
76
80
80
81
83
84
85
85
89 7.5
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
36/55
So, the first
step to findingthe StandardDeviation is tofind all thedistances fromthe mean.
Score Distance from
Mean
72 -9.5
76 - 5.5
80 - 1.5
80 - 1.5
81 - 0.5
83 1.5
84 2.5
85 3.5
85 3.5
89 7.5
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
37/55
Next, you need
to square eachof the distancesto turn them all
into positivenumbers
Score Distance from
Mean
Distances
Squared
72 -9.5 90.25
76 - 5.5 30.25
80 - 1.5
80 - 1.5
81 - 0.5
83 1.5
84 2.5
85 3.5
85 3.5
89 7.5
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
38/55
Next, you need
to square eachof the distancesto turn them all
into positivenumbers
Score Distance from
Mean
Distances
Squared
72 -9.5 90.25
76 - 5.5 30.25
80 - 1.5 2.25
80 - 1.5 2.25
81 - 0.5 0.25
83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
39/55
Add up allof the
distances
Score Distance from
Mean
Distances
Squared
72 -9.5 90.25
76 - 5.5 30.25
80 - 1.5 2.25
80 - 1.5 2.25
81 - 0.5 0.25
83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Sum:
214.5
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
40/55
Divide by (n
- 1) where nrepresentsthe amount
of numbersyou have.
Score Distance from
Mean
Distances
Squared
72 -9.5 90.25
76 - 5.5 30.25
80 - 1.5 2.25
80 - 1.5 2.25
81 - 0.5 0.25
83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Sum:
214.5
(10 - 1)
= 23.8
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
41/55
Finally, takethe SquareRoot of the
averagedistance
Score Distance from
Mean
Distances
Squared
72 -9.5 90.25
76 - 5.5 30.25
80 - 1.5 2.25
80 - 1.5 2.25
81 - 0.5 0.25
83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Sum:
214.5
(10 - 1)
= 23.8
= 4.88
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
42/55
This is theStandard
Deviation
Score Distance from
Mean
Distances
Squared
72 -9.5 90.25
76 - 5.5 30.25
80 - 1.5 2.25
80 - 1.5 2.25
81 - 0.5 0.25
83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Sum:
214.5
(10 - 1)
= 23.8
= 4.88
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
43/55
Now find
theStandardDeviation
for the otherclass grades
Score Distance from
Mean
Distances
Squared
57 - 24.5 600.25
65 - 16.5 272.25
83 1.5 2.25
94 12.5 156.25
95 13.5 182.25
96 14.5 210.25
98 16.5 272.25
93 11.5 132.25
71 - 10.5 110.25
63 -18.5 342.25
Sum:
2280.5
(10 - 1)
= 253.4
= 15.91
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
44/55
Now, lets compare the two classes again:
Team A Team B
Average on
the Quiz
Standard
Deviation
81.5 81.5
4.88 15.91
Example: Standard Deviation
-
8/14/2019 Basic Statistics.pdf
45/55
Relative Position - Quartiles
Quartiles divide the values of a data set intofour subsets of equal size, each comprising 25%of the observations.
To find the first, second, and third quartiles: 1. Arrange the Ndata values from smallest to largest.
2. First quartile, Q1 = data value at position (N + 1)/4
3. Second quartile, Q2 = data value at position 2(N +1)/4
4. Third quartile, Q3 = data value at position 3(N + 1)/4
Interquartile range = Q3-Q1
-
8/14/2019 Basic Statistics.pdf
46/55
Finding the median, quartiles and inter-quartile range.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Order the data
Inter-Quartile Range = 9 - 5 = 3
Example 1: Find the median and quartiles for the data below.
Lower
Quartile= 5
Q1
Upper
Quartile= 9
Q3
Median
= 8
Q2
-
8/14/2019 Basic Statistics.pdf
47/55
Normality47
A normal distribution is assumed by manystatistical procedures.
Normal distributions take the form of asymmetric bell-shaped curve.
The standard normal distribution is one with amean of 0 and a standard deviation of 1
Standard scores, also called z-scores orstandardized data, are scores which have had themean subtracted and which have been divided bythe standard deviation to yield scores which havea mean of 0 and a standard deviation of 1.
-
8/14/2019 Basic Statistics.pdf
48/55
Normality (Cont.)48
-
8/14/2019 Basic Statistics.pdf
49/55
Skewness49
Skewness measures the deviation of thedistribution from symmetry.
If the skewness is clearly different from 0, thenthat distribution is asymmetrical, while normaldistributions are perfectly symmetrical Positive skew left-leaning
the mean is greater than the median
Negative skew right-leaning
the median is greater than mean
Distribution Shape and Measures
-
8/14/2019 Basic Statistics.pdf
50/55
Distribution Shape and Measuresof Central Tendency
50
If mean = median = mode,
the shape of the distribution
issymmetric
If mean < median < mode,
the shape of the distribution
is Negatively Skewed
If mode< median < mean,
the shape of the distribution
isPositively Skewed
-
8/14/2019 Basic Statistics.pdf
51/55
Kurtosis51
Kurtosis is the peakedness of a distribution
A common rule-of-thumb test for normality is torun descriptive statistics to get skewness and
kurtosis, then use the criterion that kurtosisshould be within the +2 to -2 range when the dataare normally distributed
Thus, positive kurtosis indicates a relatively
peaked distribution Thus, negative kurtosis indicates a relatively flat
distribution
-
8/14/2019 Basic Statistics.pdf
52/55
Outliers52
Simple Outlier cases with extreme values with respect to a single
variable
cases which are more than plus or minus threestandard deviations from the mean of the variable
Can radically alter the outcome of analysisand are also violations of normality
Multivariate Outlier cases with extreme values with respect to
multiple variables.
-
8/14/2019 Basic Statistics.pdf
53/55
Types of Variable
Nominal A variable can be treated as nominal when its values represent categories
with no intrinsic ranking (for example, the department of the company inwhich an employee works)
Ex: region, zip code, and religious affiliation
Ordinal A variable can be treated as ordinal when its values represent categories
with some intrinsic ranking (for example, levels of service satisfaction fromhighly dissatisfied to highly satisfied).
Ex: attitude scores representing degree of satisfaction or confidence andpreference rating scores
Scale A variable can be treated as scale (continuous) when its values represent
ordered categories with a meaningful metric, so that distance comparisonsbetween values are appropriate.
Ex: age in years and income in thousands of dollars
-
8/14/2019 Basic Statistics.pdf
54/55
Statistical Significance (p-values)54
The statistical significance of a result is theprobability that the observed relationship (e.g.,between variables) or a difference (e.g., betweenmeans) in a sample occurred by pure chance
("luck of the draw"), and that in the populationfrom which the sample was drawn, no suchrelationship or differences exist.
The statistical significance of a result tells us
something about the degree to which the result is"true" (in the sense of being "representative of thepopulation").
Estimation Methods for Replacing
-
8/14/2019 Basic Statistics.pdf
55/55
Estimation Methods for ReplacingMissing Values
55
Series mean. Replaces missing values with the mean for the entire series.
Mean of nearby points. Replaces missing values with the mean of valid surroundingvalues. The span of nearby points is the number of valid values above and below themissing value used to compute the mean.
Median of nearby points. Replaces missing values with the median of validsurrounding values. The span of nearby points is the number of valid values aboveand below the missing value used to compute the median.
Linear interpolation. Replaces missing values using a linear interpolation. The lastvalid value before the missing value and the first valid value after the missing valueare used for the interpolation. If the first or last case in the series has a missing value,the missing value is not replaced.
Linear trend at point. Replaces missing values with the linear trend for that point.The existing series is regressed on an index variable scaled 1 to n. Missing values arereplaced with their predicted values.