g7-quantitative
DESCRIPTION
TRANSCRIPT
QUANTITA
TIVE D
ATA
ANALYSIS
INSTRUCTOR: DR. TUNG NGUYENGROUP 7MEMEMBER:Ly Ngoc Tra AnNgo Huong GiangTran Nhu HanhTran Thi My HanhNguyen Thi Hong ThamNguyen Thi Thao Tien
OUT LINE
Data analysis Central Tendency : Mean,Median,Mode Spread of distribution : Range, Variance,
Standard Deviation Experimental :
Paired T-Test
Anova
CENTRAL TENDENCY
The term central tendency refers to the "middle" value or perhaps a typical value of the data, and is measured using the mean, median, or mode. Each of these measures is calculated differently, and the one that is best to use depends upon the situation.
In statistics, the term central tendency relates to the way in which quantitative data tend to cluster around some value
In the simplest cases, the measure of central tendency is an average of a set of measurements, the word average being variously construed as mean, median, or other measure of location, depending on the context.
Both "central tendency" and "measure of central tendency" apply to either statistical populations or to samples from a population.
MEASURES OF CENTRAL TENDENCY
Arithmetic mean: (or simply, mean) – the sum of allmeasurements divided by the number of observations inthe data set
The mean is the most commonly-used measure of central tendency. When we talk about an "average", we usually are referring to the mean. The mean is simply the sum of the values divided by the total number of items in the set. The result is referred to as the arithmetic mean. Sometimes it is useful to give more weighting to certain data points, in which case the result is called the weighted arithmetic mean.
The mean is valid only for interval data or ratio data. Since it uses the values of all of the data points in the population or sample, the mean is influenced by outliers that may be at the extremes of the data set.
MEDIAN: THE MIDDLE VALUE THAT SEPARATES THE
HIGHER HALF FROM THE LOWER HALF OF THE DATA
SETThe median is determined by sorting the data set from lowest to highest values and taking the data point in the middle of the sequence. There is an equal number of points above and below the median. For example, in the data set {1,2,3,4,5} the median is 3; there are two data points greater than this value and two data points less than this value. In this case, the median is equal to the mean. But consider the data set {1,2,3,4,10}. In this dataset, the median still is three, but the mean is equal to 4. If there is an even number of data points in the set, then there is no single point at the middle and the median is calculated by taking the mean of the two middle points.
The median can be determined for ordinal data as well as interval and ratio data. Unlike the mean, the median is not influenced by outliers at the extremes of the data set. For this reason, the median often is used when there are a few extreme values that could greatly influence the mean and distort what might be considered typical. This often is the case with home prices and with income data for a group of people, which often is very skewed. For such data, the median often is reported instead of the mean. For example, in a group of people, if the salary of one person is 10 times the mean, the mean salary of the group will be higher because of the unusually large salary. In this case, the median may better represent the typical salary level of the group.
MODE (STATISTICS): THE MOST FREQUENT VALUE IN
THE DATA SET
The mode is the most frequently occurring value in the data set. For example, in the data set {1,2,3,4,4}, the mode is equal to 4. A data set can have more than a single mode, in which case it is multimodal. In the data set {1,1,2,3,3} there are two modes: 1 and 3.
The mode can be very useful for dealing with categorical data. For example, if a sandwich shop sells 10 different types of sandwiches, the mode would represent the most popular sandwich. The mode also can be used with ordinal, interval, and ratio data. However, in interval and ratio scales, the data may be spread thinly with no data points having the same value. In such cases, the mode may not exist or may not be very meaningful.
WHEN TO USE MEAN, MEDIAN, AND MODE
Measurement Scale
Best Measure of the "Middle"
Nominal(Categorical)
Mode
Ordinal Median
Interval Symmetrical data: MeanSkewed data: Median
Ratio Symmetrical data: MeanSkewed data: Median
A RANGE, A VARIANCE, AND A STANDARD DEVIATION
RANGE
Range = The range indicates the distance between the two most extreme scores in a distribution
>>> Range = highest score – lowest score
VARIANCE AND STANDARD DEVIATION
•The variance and standard deviation are two
measures of variability that indicate how
much the scores are spread out around the p
mean
• We use the mean as our reference point since
it is at the center of the distribution
Variance = how spread out (far away) a number is from the mean
Standard Deviation = loosely defined as the average amount a number differs from the mean
We will use the following sample data set to explain the range, variance, and standard deviation:
4, 6, 3, 7, 9, 4, 2, 1, 4, 2
SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
Range:
R = maximum score - minimum score
In order to figure out the range, A) arrange your data set in order from lowest to highest and B) subtract the lowest number from the highest number.
A) When arranged in order, 4, 6, 3, 7, 9, 4, 2, 1, 4, 2 becomes: 1, 2, 2, 3, 4, 4, 4, 6, 7, 9
B) The lowest number is 1 and the highest number is 9. Therefore, R = 9-1 = 8
SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
The Computational Formula:
From the above formula:
S2 = variance
Σ = sigma = the sum of (add up all the numbers)
X = the numbers from your data set
X2 = the numbers from your data set squared
N = the total number of numbers you have in your data set
SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
The easiest way to compute variance with the computational formula is as follows:
A) List each of the numbers in your data set vertically & get the sum of that column
B) Figure out n (count how many numbers you have in your data set)
C) Square each number in your data set and get the sum of that column
A): C):
X X2
4 42=16
6 62=36
3 32=9
7 72=49
9 92=81
4 42=16
2 22=4
1 12=1
4 42=16
2 22=4
Σ=42 Σ=232
B): N=10
SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
Now use the sum for part A) and C), as well as the value for N which you found in part B) to fill in the formula:
Do the math and S2 = 5.56
SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
The Conceptual Formula:
From the above formula:
S2 = variance
Σ = sigma = the sum of (add up all the numbers)
X = the numbers from your data set
M = the mean
N = the total number of numbers you have in your data set
SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
The easiest way to compute variance with the computational formula is as follows:
A) List each of the numbers in your data set vertically & get the sum of that column
B) Figure out n (count how many numbers you have in your data set)
C) Figure out M
D) Subtract M from each number in your data set (Notice how the sum is zero)
E) Square the numbers you got for part D) and get the sum of that column
A): D): E):
X (X-M) (X-M)2
4 (4-4.2)= -0.2 (-0.2)2= 0.04
6 (6-4.2)= 1.8 (1.8)2= 3.24
3 (3-4.2)= -1.2 (-1.2)2= 1.44
7 (7-4.2)= 2.8 (2.8)2= 7.84
9 (9-4.2)= 4.8 (4.8)2= 23.04
4 (4-4.2)= -0.2 (-0.2)2= 0.04
2 (2-4.2)= -2.2 (-2.2)2= 4.84
1 (1-4.2)= -3.2 (-3.2)2= 10.24
4 (4-4.2)= -0.2 (-0.2)2= 0.04
2 (2-4.2)= -2.2 (-2.2)2= 4.84
Σ=42 Σ=0 Σ=55.6
B): N=10
C): M= 42/10=4.2
Now use the sum for part E), as well as the value for N which you found in part B) to fill in the formula:
Do the math and S2 = 5.56
STANDARD DEVIATION:
Standard deviation is simply the square root of the variance. Therefore, it does not matter if you use the computational formula or the conceptual formula to compute variance.
For our sample data set, our variance came out to be 5.56, regardless of the formula used. The standard deviation for our data set then becomes: S = = 2.36
INDEPENDENT SAMPLES
• The independent samples t-test is used when two separate sets of independent and identically distributed samples are obtained, one from each of the two populations being compared.
• E.g: suppose we are evaluating the effect of a medical treatment, and we enroll 100 subjects into our study, then randomize 50 subjects to the treatment group and 50 subjects to the control group. In this case, we have two independent samples and would use the unpaired form of the t-test. The randomization is not essential here—if we contacted 100 people by phone and obtained each person's age and gender, and then used a two-sample t-test to see whether the mean ages differ by gender, this would also be an independent samples t-test, even though the data are observational.
INDEPENDENT DATA ANALYSIS
Calculations:
a. Equal sample sizes, equal variance
b. Unequal sample sizes, equal variance
c. Unequal sample sizes, unequal variance
A. EQUAL SAMPLE SIZES, EQUAL VARIANCE
This test is only used when both:
the two sample sizes (that is, the number, n, of participants of each group) are equal;
it can be assumed that the two distributions have the same variance.
B. UNEQUAL SAMPLE SIZES, EQUAL VARIANCE
This test is used only when it can be assumed that the two distributions have the same variance.
C. UNEQUAL SAMPLE SIZES, UNEQUAL VARIANCE
This test, also known as Welch's t-test, is used only when the two population variances are assumed to be different (the two sample sizes may or may not be equal) and hence must be estimated separately.
WORKED EXAMPLE
• A study of the effect of caffeine on muscle metabolism used eighteen male volunteers who each underwent arm exercise tests. Nine of the men were randomly selected to take a capsule containing pure caffeine one hour before the test. The other men received a placebo capsule. During each exercise the subject's respiratory exchange ratio (RER) was measured. (RER is the ratio of CO2 produced to O2 consumed and is an indicator of whether energy is being obtained from carbohydrates or fats).
• Question: whether, on average, caffeine changes RER.• Populations: “men who have not taken caffeine” and “men who
have taken caffeine”. (If caffeine has no effect on RER the two sets of data can be regarded as having come from the same population.)
Placebo Caffeine
105 96
119 99
100 94
97 89
96 96
101 93
94 88
95 105
98 88
Mean = 100.56
Mean = 94.22
SD = 7.70 SD = 5.61
• The means show that, on average, caffeine appears to have altered RER from about 100.6% to 94.2%, a change of 6.4%•. However, there is a great deal of variation between the data values in both samples and considerable overlap between them. • Is the difference between the two means simply due sampling variation, or does the data provide evidence that caffeine does, on average, reduce RER? >> p-value answers this question.•The t-test tests the null hypothesis that the mean of the caffeine treatment equals the mean of the placebo versus the alternative hypothesis that the mean of caffeine treatment is not equal to the mean of the placebo treatment.•Computer output obtained for the RER data gives the sample means and the 95% confidence interval for the difference between the means.
COMPUTER OUTPUT
The p-value is 0.063 and, therefore, the difference between the two means is not statistically significantly different from zero at the 5% level of significance. There is an estimated change of 6.4% (SE = 3.17%). However, there is insufficient evidence (p = 0.063) to suggest that caffeine does change the mean RER.
Alternative suggestionIt could be argued, however, that the researcher might only be interested in whether 'caffeine reduces RER'. That is, the researcher is looking for a specific direction for the difference between the two population means. This is an example of a one-tailed t-test as opposed to a two-tailed t-test outlined above.
SPSS only performs a 2-tailed test (the non-directional alternative hypothesis) and to obtain the p-value for the directional alternative hypothesis (one-tailed test) the p-value should be halved. Hence, in this example, p = 0.032.
Report: The mean RER in the caffeine group (94.2 ± 1.9) was significantly lower (t = 1.99, 16 df, one-tailed t-test, p = 0.032) than the mean of the placebo group (100.6 ± 2.6).
Note: It is important to decide whether a one- or two-tailed test is being carried-out, before analysis takes place.Otherwise it might be tempting to see what the p-value is before making your decision!
A suitable null hypothesis in both cases is
H0: On average, caffeine has no effect on RER, with an alternative (or experimental) hypothesis,
H1: On average, caffeine changes RER (2-tail test), or H1: On average, caffeine reduces RER (1-tail case).
2. ONE SAMPLE T-TEST
Compare the mean score of a sample to a known value. Usually, the known value is a population mean.
Assumption:
The dependent variable is normally distributed.
In testing the null hypothesis that the population mean is equal to a specified value μ, use the statistic:
: sample mean S: sample standard deviationn: sample size
2. PAIRED SAMPLES T-TEST
What it does: compare the means of two variables compute the difference between the two variables for each case, and test to see if the
average difference is significantly different from zero
Assumption:Both variables should be normally distributed.
Hypothesis: Null: There is no significant difference between the means of the two variables. Alternate: There is a significant difference between the means of the two variables.
Difference between a paired samples t-test and an independent samples t-test?
Both tests are used to find significant differences between groups, but the independent samples t-test assumes the groups are not related to each other, while the dependent samples t-test or paired samples t-test assumes the groups are related to each other.
A dependent samples t-test or paired samples t-test would be used to find differences within groups, while the independent samples t-test would be used to find differences between groups.
Independent variable and dependent variable: The independent variable and the dependent variable is the same in both the
dependent samples t-test and the independent samples t-test. The variable of measure of the variable of interest is the dependent variable and the
grouping variable is the independent variable.
The most common use of the dependent samples t-test is in a pretreatment vs. posttreatment scenario where the researcher wants to test the effectiveness of a treatment.
1. The participants are tested pretreatment, to establish some kind of a baseline measure
2. The participants are then exposed to some kind of treatment
3. The participants are then tested posttreatment, for the purposes of comparison with the pretreatment scores
For this equation, the differences between all pairs must be calculated. The pairs are either one person's pre-test and post-test scores or between pairs of persons matched into meaningful groups. The average and standard deviation of those differences are used in the equation. The degree of freedom used is n − 1.
EXAMPLE: SPSS OUTPUT
We compared the mean test scores before (pre-test) and after (post-test) the subjects completed a test preparation course.
We want to see if our test preparation course improved people's score on the test
The post-test mean scores are higher.
There is a strong positive correlation. People who did well on the pre-test also did well on the post-test.
Remember, this test is based on the difference between the two variables. Under "Paired Differences" we see the descriptive statistics for the difference between the two variables.
The T value = -2.171
We have 11 degrees of freedom
Our significance is .053
If the significance value is less than .05, there is a significant difference.If the significance value is greater than. 05, there is no significant difference.
Conclusion: There is no difference between pre- and post-test scores. Our test preparation course did not help!
ANOVA
PRESENTER: TRAN NHU HANH
WHAT IS ANOVA?
• ANOVA is an analysis of the variation present in an experiment. It is a test of the hypothesis that the variation in an experiment is no greater than that due to normal variation of individuals' characteristics and error in their measurement.
• ANOVA, is a technique from statistical interference that allows us to deal with several populations
TYPES OF ANOVA
1. One-way ANOVA
2. Two-way ANOVA
ONE-WAY ANOVA DEFINITION
• A One-way ANOVA is used when comparing two or more group means on a continuous dependent variable. In other words, one-way ANOVA techniques can be used to study the effect of k(>2) levels of a single factor.
• The independent T-Test is a special case of the One-way ANOVA for situatiosn where there are only two group means
MAJOR CONCEPTS:
1. CALCULATING SUMS OF SQUARES• The One-way ANOVA separates the total variance in the continuous
dependent variable into two components: Variability between the groups and Variability within the groups
• Variability between the groups is calculated by first obtaining the sums of squares between groups (SSb), or the sum of the square differences between each indibidual group mean from the grand mean
• Variability within the groups is calculated by first obtaining the sums of squares within groups (SSw) or the sum of the squared differences beyween each individual score and that individual’s group mean.
TYPES OF VARIABLES FOR ONE-WAY ANOVA
• The IV (Independent Variable) is categorical. The categorical IV can be two groups or it can have more than two groups.
• The DV (Dependent Variable) is continuous
• Data are collected on both variables for each person in the study.
EXAMPLES OF RESEARCH QUESTIONS FOR ONE-WAY ANOVA
1. Is there a significant difference in student attitudes toward the course between students who pass or fail a course?
• Student attitude is continuous
• Passing a course is categorical (pass/fail)
Because the IV has only 2 groups, we can use independent T-Test
2. Does student satisfaction significantly differ by location of institution (rural, urban, suburban)?
• Student satisfaction is continuous
• Institution location is categorical
The linear model, conceptually, is:
SSt = SSb + SSw
SSt: total sums of squares
SSb: sums of squares between groups
SSw: sums of squares within groups
ONE-WAY ANOVA AS A RATIO OF VARIANCES:
Formula for variance:
Numerator: a sum of squared values (or a sums of squares)
Denominator: degrees of freedom
• The ANOVA analyzes the ratio of the variance between groups the variance within the groups
• In ANOVA, these variances, formerly known to us as , are referred as mean squares (MS). Mean squares are calculated by dividing each sum of squares by the degrees of freedom associated with it.
• Thus, a mean square between is simply the variance between groups obtained by a sums of squares divided by degrees of freedom
• Likewise, a mean square within is simply the variance between groups obtained by a sums of squares divided by degrees of freedom
FACTORS THAT AFFECT SIGNIFICANCE
F -ratio: the variation due to an experimental treatment or effect divided by the variation due to experimental error. The null hypothesis is this ratio equals 1.0, or the treatment effect is the same as the experimental error. This hypothesis is rejected if the F-ratio is significantly large enough that the possibility of it equaling 1.0 is smaller than some pre-assigned criteria such as 0.05 (one in twenty)
The MSb and the MSw are then divided to obtain the F ratio
for hypothesis testing
DISTRIBUTION OF F - RATIO
• F distribution is positively skewed• If F statistic falls near 1.0, then
most likely the null is true• If F statistic is large, expect null is
false. Thus, signigicant F ratios will be in the tail of the F distribution
P VALUE
In statistical hypothesis testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the significance level α, which is often 0.05 or 0.01.
• The larger the value of t, the more liley we are to find significant results
• t is a special case of ANOVA when only two groups comprise the independent variable
• We’re famimilar with the t distribution as normally distributed (for large df), with positive and negative values. The F statistics, on the other hand, is positively skewed, and is comprised of squared values. Thus, for any two group situation, t2= F
t2= F
CALCULATIONS
• dfb = k-1(k: numbers of samples/ groups/ levels)
• dfw = N- k (total of individuals in groups)
• dfT = N -1
• MSb = SSb/ dfb
• MSw = SSw/ dfw
• F = MSb/ MSw
STEPS IN ONE-WAY ANOVA
STEP 1: STATE HYPOTHESES
To determine if different levels of factor affect measured observations differently, the following hypotheses are tested.
• There is no significant difference among groups in variable X
• There is a significant difference between at least two of the groups in the variable X. In other words, at least one mean will significantly differ.
STEP 2: SET THE CRITERION FOR REJECTING HO
STEP 3: COMPUTE TEST STATISTIC
STEP 4: COMPARE TEST STATISTIC TO CRITERION
STEP 5: MAKE DECISION
• Fail to reject the null hypothesis and conclude tha there is no significant different among the group F(dfb, dfw) = insert F statistic, p> insert α
• Reject the null hypothesis and conclude that there is a significant difference among the grou F(dfb, dfw) = insert F statistic, p <insert α
Difference between one-way and two-way ANOVA
ANOVA Test
TWO-WAY ANOVA
ONE-WAY ANOVA
• One-Way ANOVA has one independent variable (1 factor) with > 2 conditions
– conditions = levels = treatments
– e.g., for a brand of cola factor, the levels are:
Coke, Pepsi, RC Cola
• Independent variables = factors
TWO-WAY ANOVA
• Two-Way ANOVA has 2 independent variables (factors)– each can have multiple conditions
Example• Two Independent Variables (IV’s)– IV1: Brand; and IV2: Calories– Three levels of Brand:
• Coke, Pepsi, RC Cola- Two levels of Calories:
• Regular, Diet
WHEN TO USE
• One-way ANOVA: you have more than two levels (conditions) of a single IV
– EXAMPLE: studying effectiveness of three types of pain reliever
aspirin vs. tylenol vs. ibuprofen• Two-way ANOVA: you have more than one IV (factor)
– EXAMPLE: studying pain relief based on pain reliever and type of pain• Factor A: Pain reliever (aspirin vs. tylenol)• Factor B: type of pain (headache vs. back pain
NOTATION
Factor A Factor B. a : the number of categories of Factor A, b : the number of categories of Factor B.Total number of groups is ab.TThe total number of observations N .The response/dependent variable value for each
observation :Yijk , where i : the subject’s category for Factor A, and j : the
subject’s category for Factor B. Then i and j together : a group, and k denotes which individual we’re talking about within this particular group.
The number of observations in each group n and N = abn.
How the number of hours of TV people watch per week depends on two variables: gender and age. Each person is classified according to gender (male, female) and age (18–24, 25–54,55+).
There are six groups—one for each combination of gender and age. We randomly sample five people from each group, and each person reports the time, in hours, that he or she watches TV per week. The data is shown in
Age 18–24
Age 25–54
Age 55+
Male 2027202228
2321232828
3333393337
Female 2519273231
3226333324
4443524354
TWO-WAY ANOVA TABLE
1. Sums of squares.
2. Degrees of freedom.
3. Mean squares.
There are three main questions that we might ask in two-way ANOVA:
• Does the response variable depend on Factor A?
• Does the response variable depend on Factor B?
• Does the response variable depend on Factor A differently for different values of Factor B, and vice versa?
Whether TV viewing time depends on age and gender.
The third question asks whether TV viewing time depends on gender differently for people of different ages, or whether TV viewing time depends on age differ- ently for men than for women.
(For example, perhaps it’s true that women 55+ watch more TV than men 55+, but women 18–24 watch less TV than men 18–24.)
1.Sums of Squares
Two-way ANOVA involves five different sums of squares:
• The total sum of squares, SS Tot , measures the total variability in the response variable values. Its formula is
• The Factor A sum of squares, SS A, measures the variability that can be explained by differences in Factor A. Its formula is
_Yij● represents the sample mean of the group in category i of
Factor A and category j of Factor B (always an average of n observations)._
Yi●●represents the sample mean of all the data in category i of Factor A combined (always an average of bn observations)._
Y●j●represents the sample mean of all the data in category j of Factor B combined (always an average of an observations)._
Y●●●represents the overall sample mean of all the data from all groups combined (always an average of all abn = N observations).
• The Factor B sum of squares, SS B , measures the variability that can be explained by differences in Factor B. Its formula is
•The interaction sum of squares, SS AB , measures the variability that can be explained by interaction between the effects of Factors A and B. (We’ll talk more about what this means later.) Its formula is
•The error sum of squares, SS E , measures the variability of the ob- servations around their group sample means. Its formula is
•If we call the sample standard deviation within each group sij , then another formula for SS E is
Degrees of freedom
Mean squares.
ANOVA TABLE
Using statistical software-
TWO-WAY ANOVA HYPOTHESIS TESTS
• Does the response variable depend on Factor A?
• Does the response variable depend on Factor B?
• Does the response variable depend on Factor A differently for different values of Factor B, and vice versa?
Main effects
Interaction
Interaction :
We say that there is interaction if Y depends on Factor A differently for different values of Factor B, and vice versa.
Similarly, we say that there is NO interaction if Y depends on Factor A in the same way for all values of Factor B, and vice versa.
HYPOTHESES
In the test for interaction, the null hypothesis (Ho) is that there is no interaction, while the alternative hypothesis (Ha) is that there is interaction.
There is no interaction on the left. For each age group, women average watching five more hours of TV per week than men. For each gender, the middle age group averages watching six more hours of TV per week than the youngest age group, and the oldest age group averages watching nine more hours of TV per week than the middle age group.
• There is interaction on the right. For each age group, women average watching more TV than men, but how much more varies for the different age groups. Also, for each gender, older people average watching more TV, but how much more varies by gender.
ASSUMPTIONS
The assumptions for the two-way ANOVA F test for interaction are exactly the same as those of the one-way ANOVA F test, with one additional re- quirement: the number of observations should be the the same for all groups.
TEST STATISTIC
P-VALUE
DECISION
• If we believe there is interaction, then we don’t bother to ask whether the response depends on Factor A or Factor B separately—the fact that there is interaction means that the response depends on Factor A differently for different values of Factor B, and vice versa. So we stop here and do not perform the tests for main effects (which we’ll talk about in the next subsection).• If we believe it’s reasonable that there is no interaction, then that means we can look at the effects of Factor A and Factor B separately, so we proceed to the tests for main effects.