Download - Basic Statistics.pdf

8/14/2019 Basic Statistics.pdf

1/55

RESEARCH METHODS

Dr. M. Shakaib AkramEmail: [email protected]


2/55

BASIC STATISTICS


3/55

Raw Data3

Raw data have not been manipulated or treated in anyway beyond their original collection

Ex: speeds (mph) of 105 vehicles


4/55

Frequency Distribution4

A table that divides the data values into classes andshows the number of observed values that fall into eachclass


5/55

Frequency Distribution5

Relative Frequency Distribution

the proportion or percentage of data values thatfall within each category

Cumulative relative frequency distribution the number of observations that are within or

below each of the classes


6/55

Histogram6

The histogram describes a frequency distribution by using a seriesof adjacent rectangles, each of which has a length that isproportional to the frequency of the observations within the rangeof values it represents


7/55

The Stem-and-Leaf Display7

a variant of the frequency distribution, uses asubset of the original digits as class descriptors Ex: The raw data are the numbers of Congressional bills vetoed during the administrations of

seven U.S. presidents, from Johnson to Clinton

Stem-and-leaf diagram


8/55

Bar Chart8

Bar chart represents frequencies according tothe relative lengths of a set of rectangles

Bar Chart vs histogram: Histogramquantitative/continous data

Bar chart qualitative/categorical data

adjacent rectangles in the histogram share a common side, whilethose in the bar chart have a gap between them


9/55

The Scatterplot

A scatter diagram is a two-dimensional plot ofdata representing values of two quantitativevariables.

x, the independent variable, on the horizontal axis y, the dependent variable, on the vertical axis

Four ways in which two variables can be related:

1. Direct

2. Inverse

3. Curvilinear

4. No relationship


10/55

The Scatterplot10


11/55

Coefficient of correlation (r)

Coefficient of correlation, r Direction of the relationship:

direct (r > 0) or inverse (r < 0)

Strength of the relationship:When r is close to 1 or 1, the linear relationshipbetween x andy is strong. When r is close to 0, thelinear relationship between x andy is weak. When r =0, there is no linear relationship between x andy.

Coefficient of determination, r2 The percent of total variation iny that is explained by

variation in x.

11


12/55

Coefficient of correlation12


13/55



14/55



15/55

The Center: Mean


16/55

The Center: Median

To find the median: 1. Put the data in an array.

2A. If the data set has an ODD number of numbers, themedian is the middle value.

2B. If the data set has an EVEN number of numbers, themedian is the AVERAGE of the middle two values.

(Note that the median of an even set of data values is notnecessarily a member of the set of values.)

The median is particularly useful if there areoutliers in the data set, which otherwise tend toinfluence the value of an arithmetic mean.


17/55

The Center: Mode

The mode is the most frequent value.

While there is just one value for the mean and one valuefor the median, there may be more than one value forthe mode of a data set.

The mode tends to be less frequently used than themean or the median.


18/55

The Spread: Range

The range is the distancebetween the smallestand the largest data value in the set.

Range = largest value

smallest value

Sometimes range is reported as an interval,anchored between the smallest and largest

data value, rather than the actual width of thatinterval.


19/55

The Spread: Variance

How far a set of numbers is spread out

Variance is one of the most frequently usedmeasures of spread


20/55

The Spread: Standard Deviation

A measure of the dispersion of a set of datafrom its mean

Mathematically: the square root of variance

for a population,

for a sample,

2

s

s

2


21/55

Example: Standard Deviation

Two classes took a recent quiz. There were 10students in each class, and each class had anaverage score of 81.5


22/55

Since the averages are the same,can we assume that the studentsin both classes all did pretty much

the same on the exam?



23/55

The average (mean) does not tell

us anything about thedistribution or variation in thegrades.


The answer is No.


24/55

Here are Dot-Plots of the gradesin each class:



25/55

Mean


26/55

So, we need to come up withsome way of measuring not

just the average, but also thespread of the distribution of

our data.



27/55

Why not just give an averageand the range of data (the

highest and lowest values) todescribe the distribution of

the data?



28/55

Well, for example, lets sayfrom a set of data, the average

is 17.95 and the range is 23.

But what if the data looked likethis:



29/55

Here is the average

And here is the range

But really, most of the

numbers are in this area,

and are not evenly

distributed throughout therange.



30/55

The Standard Deviation is a

number that measures howfar away each number in a setof data is from their mean.



31/55

If the Standard Deviation islarge, it means the numbers

are spread out from theirmean.

If the Standard Deviation issmall, it means the numbers

are close to their mean.

small,

large,



32/55

Here arethe scores

on the mathquiz forTeam A:

72

76

80

80

81

83

84

85

85

89

Average:

81.5



33/55

The Standard Deviation measures how far away eachnumber in a set of data is from their mean.For example, start with the lowest score, 72. How far away is 72 from the mean of

81.5? 72 - 81.5 = - 9.5

- 9.5



34/55

- 9.5

Or, start with the highest score, 89. How far away is 89 from the mean of 81.5?

89 - 81.5 = 7.5

7.5



35/55

So, the first

step to findingthe StandardDeviation is tofind all thedistances fromthe mean.

Score Distance from

Mean

72 -9.5

76

80

80

81

83

84

85

85

89 7.5



36/55

So, the first

step to findingthe StandardDeviation is tofind all thedistances fromthe mean.

Score Distance from

Mean

72 -9.5

76 - 5.5

80 - 1.5

80 - 1.5

81 - 0.5

83 1.5

84 2.5

85 3.5

85 3.5

89 7.5



37/55

Next, you need

to square eachof the distancesto turn them all

into positivenumbers

Score Distance from

Mean

Distances

Squared

72 -9.5 90.25

76 - 5.5 30.25

80 - 1.5

80 - 1.5

81 - 0.5

83 1.5

84 2.5

85 3.5

85 3.5

89 7.5



38/55

Next, you need

to square eachof the distancesto turn them all

into positivenumbers

Score Distance from

Mean

Distances

Squared

72 -9.5 90.25

76 - 5.5 30.25

80 - 1.5 2.25

80 - 1.5 2.25

81 - 0.5 0.25

83 1.5 2.25

84 2.5 6.25

85 3.5 12.25

85 3.5 12.25

89 7.5 56.25



39/55

Add up allof the

distances

Score Distance from

Mean

Distances

Squared

72 -9.5 90.25

76 - 5.5 30.25

80 - 1.5 2.25

80 - 1.5 2.25

81 - 0.5 0.25

83 1.5 2.25

84 2.5 6.25

85 3.5 12.25

85 3.5 12.25

89 7.5 56.25

Sum:

214.5



40/55

Divide by (n

- 1) where nrepresentsthe amount

of numbersyou have.

Score Distance from

Mean

Distances

Squared

72 -9.5 90.25

76 - 5.5 30.25

80 - 1.5 2.25

80 - 1.5 2.25

81 - 0.5 0.25

83 1.5 2.25

84 2.5 6.25

85 3.5 12.25

85 3.5 12.25

89 7.5 56.25

Sum:

214.5

(10 - 1)

= 23.8



41/55

Finally, takethe SquareRoot of the

averagedistance

Score Distance from

Mean

Distances

Squared

72 -9.5 90.25

76 - 5.5 30.25

80 - 1.5 2.25

80 - 1.5 2.25

81 - 0.5 0.25

83 1.5 2.25

84 2.5 6.25

85 3.5 12.25

85 3.5 12.25

89 7.5 56.25

Sum:

214.5

(10 - 1)

= 23.8

= 4.88



42/55

This is theStandard

Deviation

Score Distance from

Mean

Distances

Squared

72 -9.5 90.25

76 - 5.5 30.25

80 - 1.5 2.25

80 - 1.5 2.25

81 - 0.5 0.25

83 1.5 2.25

84 2.5 6.25

85 3.5 12.25

85 3.5 12.25

89 7.5 56.25

Sum:

214.5

(10 - 1)

= 23.8

= 4.88



43/55

Now find

theStandardDeviation

for the otherclass grades

Score Distance from

Mean

Distances

Squared

57 - 24.5 600.25

65 - 16.5 272.25

83 1.5 2.25

94 12.5 156.25

95 13.5 182.25

96 14.5 210.25

98 16.5 272.25

93 11.5 132.25

71 - 10.5 110.25

63 -18.5 342.25

Sum:

2280.5

(10 - 1)

= 253.4

= 15.91



44/55

Now, lets compare the two classes again:

Team A Team B

Average on

the Quiz

Standard

Deviation

81.5 81.5

4.88 15.91



45/55

Relative Position - Quartiles

Quartiles divide the values of a data set intofour subsets of equal size, each comprising 25%of the observations.

To find the first, second, and third quartiles: 1. Arrange the Ndata values from smallest to largest.

2. First quartile, Q1 = data value at position (N + 1)/4

3. Second quartile, Q2 = data value at position 2(N +1)/4

4. Third quartile, Q3 = data value at position 3(N + 1)/4

Interquartile range = Q3-Q1


46/55

Finding the median, quartiles and inter-quartile range.

12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Order the data

Inter-Quartile Range = 9 - 5 = 3

Example 1: Find the median and quartiles for the data below.

Lower

Quartile= 5

Q1

Upper

Quartile= 9

Q3

Median

= 8

Q2


47/55

Normality47

A normal distribution is assumed by manystatistical procedures.

Normal distributions take the form of asymmetric bell-shaped curve.

The standard normal distribution is one with amean of 0 and a standard deviation of 1

Standard scores, also called z-scores orstandardized data, are scores which have had themean subtracted and which have been divided bythe standard deviation to yield scores which havea mean of 0 and a standard deviation of 1.


48/55

Normality (Cont.)48


49/55

Skewness49

Skewness measures the deviation of thedistribution from symmetry.

If the skewness is clearly different from 0, thenthat distribution is asymmetrical, while normaldistributions are perfectly symmetrical Positive skew left-leaning

the mean is greater than the median

Negative skew right-leaning

the median is greater than mean

Distribution Shape and Measures


50/55

Distribution Shape and Measuresof Central Tendency

50

If mean = median = mode,

the shape of the distribution

issymmetric

If mean < median < mode,


is Negatively Skewed

If mode< median < mean,


isPositively Skewed


51/55

Kurtosis51

Kurtosis is the peakedness of a distribution

A common rule-of-thumb test for normality is torun descriptive statistics to get skewness and

kurtosis, then use the criterion that kurtosisshould be within the +2 to -2 range when the dataare normally distributed

Thus, positive kurtosis indicates a relatively

peaked distribution Thus, negative kurtosis indicates a relatively flat

distribution


52/55

Outliers52

Simple Outlier cases with extreme values with respect to a single

variable

cases which are more than plus or minus threestandard deviations from the mean of the variable

Can radically alter the outcome of analysisand are also violations of normality

Multivariate Outlier cases with extreme values with respect to

multiple variables.


53/55

Types of Variable

Nominal A variable can be treated as nominal when its values represent categories

with no intrinsic ranking (for example, the department of the company inwhich an employee works)

Ex: region, zip code, and religious affiliation

Ordinal A variable can be treated as ordinal when its values represent categories

with some intrinsic ranking (for example, levels of service satisfaction fromhighly dissatisfied to highly satisfied).

Ex: attitude scores representing degree of satisfaction or confidence andpreference rating scores

Scale A variable can be treated as scale (continuous) when its values represent

ordered categories with a meaningful metric, so that distance comparisonsbetween values are appropriate.

Ex: age in years and income in thousands of dollars


54/55

Statistical Significance (p-values)54

The statistical significance of a result is theprobability that the observed relationship (e.g.,between variables) or a difference (e.g., betweenmeans) in a sample occurred by pure chance

("luck of the draw"), and that in the populationfrom which the sample was drawn, no suchrelationship or differences exist.

The statistical significance of a result tells us

something about the degree to which the result is"true" (in the sense of being "representative of thepopulation").

Estimation Methods for Replacing


55/55

Estimation Methods for ReplacingMissing Values

55

Series mean. Replaces missing values with the mean for the entire series.

Mean of nearby points. Replaces missing values with the mean of valid surroundingvalues. The span of nearby points is the number of valid values above and below themissing value used to compute the mean.

Median of nearby points. Replaces missing values with the median of validsurrounding values. The span of nearby points is the number of valid values aboveand below the missing value used to compute the median.

Linear interpolation. Replaces missing values using a linear interpolation. The lastvalid value before the missing value and the first valid value after the missing valueare used for the interpolation. If the first or last case in the series has a missing value,the missing value is not replaced.

Linear trend at point. Replaces missing values with the linear trend for that point.The existing series is regressed on an index variable scaled 1 to n. Missing values arereplaced with their predicted values.

Download - Basic Statistics.pdf

Top Related