© 2007 prentice hall16-1 some preliminaries. © 2007 prentice hall16-2 basics of analysis the...

46
© 2007 Prentice Hall 16-1 Some Preliminaries

Upload: vernon-tucker

Post on 29-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-1

Some Preliminaries

Page 2: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-2

Basics of Analysis

The process of data analysis

Example 1: Gift Catalog Marketer Mails 4 times a year to its customers Company has I million customers on its file

Observation Data Information

Encode Analysis

Page 3: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-3

Example 1

Cataloger would like to know if new

customers buy more than old

customers?

Classify New Customers as anyone

who brought within the last twelve

months.

Analyst takes a sample of 100,000

customers and notices the following.

Page 4: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-4

Example 1

5000 orders received in the last month

3000 (60%) were from new customers

2000 (40%) were from old customers

So it looks like the new customers are

doing better

Page 5: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-5

Example 1

Is there any Catch here!!!!!

Data at this gross level, has no discrimination between customers within either group. A customer who bought within the last 11

days is treated exactly similar to a customer who bought within the last 11 months.

Page 6: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-6

Example 1

Can we use some other variable to distinguish

between old and new Customers?

Answer: Actual Dollars spent !

What can we do with this variable? Find its Mean and Variation.

We might find that the average purchase

amount for old customers is two or three times

larger than the average among new customers

Page 7: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-7

Numerical Summaries of data

The two basic concepts are the center and the Spread of the data

Center of data- Mean, which is given by- Median- Mode

n

xx

n

ii

1

Page 8: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-8

Numerical Summaries of data

Forms of Variation

Sum of differences about the mean:

Variance:

Standard Deviation: Square Root of Variance

1

)(1

2

n

xxn

ii

)(1

n

ii xx

Page 9: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-9

Confidence Intervals In catalog eg, analyst wants to know

average purchase amount of customers He draws two samples of 75 customers

each and finds the means to be $68 and $122

Since difference is large, he draws another 38 samples of 75 each

The mean of means of the 40 samples turns out to be $ 94.85

How confident should he be of this mean of means?

Page 10: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-10

Confidence Intervals

Analyst calculates the standard deviation

of sample means, called Standard Error

(SE). It is 12.91

Basic Premise for confidence Intervals 95 percent of the time the true mean

purchase amount lies between plus or minus

1.96 standard errors from the mean of the

sample means.

C.I. = Mean (+or-) (1.96) * Standard Error

Page 11: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-11

Confidence Intervals

However, if CI is calculated with only one sample then Standard Error of sample mean

= Standard deviation of sample

Basic Premise for confidence Intervals with one sample 95 percent of the time the true mean lies between plus or

minus 1.96 standard errors from the sample means.

n

Page 12: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-12

Example 2: Confidence Intervals for response rates

You are the marketing analyst for Online Apparel Company

You want to run a promotion for all customers on your database

In the past you have run many such promotions Historically you needed a 4.5% response for the

promotions to break-even You want to test the viability of the current full-

scale promotion by running a small test promotion

Page 13: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-13

Example 2: Confidence Intervals for response rates

Test 1,000 names selected at random from a new list.

To break-even the list must be expected to have a response rate of 4.5 percent

Confidence Interval= Expected Response (+/-) 1.96*SE

= p(+/-) 1.96*SE

In our case C.I. = 3.22 % to 5.78%. Thus any response between 3.22 and 5.78 % supports hypothesis that true response rate is 4.5%

Page 14: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-14

Example 2: Confidence Intervals for response rates

The list is mailed and actually pulls in 3.5% Thus, the true response rate maybe 4.5% What if the actual rate pulled in were 5% ? Regression towards mean: Phenomenon of

test result being different from true result

Give more thought to lists whose cutoff rates lie within confidence interval

Page 15: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-15

Frequency Distribution and Cross-Tabulation

© 2007 Prentice Hall 15

Page 16: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-16

Chapter Outline1) Frequency Distribution

2) Statistics Associated with Frequency Distribution

i. Measures of Location

ii. Measures of Variability

iii. Measures of Shape

3) Cross-Tabulations

i. Two Variable Case

ii. Three Variable Case

iii. General Comments on Cross-Tabulations

4) Statistics for Cross-Tabulation: Chi-Square

Page 17: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-17

Internet Usage Data Respondent Sex Familiarity Internet Attitude Toward Usage of InternetNumber Usage Internet Technology Shopping Banking 1 1.00 7.00 14.00 7.00 6.00 1.00 1.002 2.00 2.00 2.00 3.00 3.00 2.00 2.003 2.00 3.00 3.00 4.00 3.00 1.00 2.004 2.00 3.00 3.00 7.00 5.00 1.00 2.00 5 1.00 7.00 13.00 7.00 7.00 1.00 1.006 2.00 4.00 6.00 5.00 4.00 1.00 2.007 2.00 2.00 2.00 4.00 5.00 2.00 2.008 2.00 3.00 6.00 5.00 4.00 2.00 2.009 2.00 3.00 6.00 6.00 4.00 1.00 2.0010 1.00 9.00 15.00 7.00 6.00 1.00 2.0011 2.00 4.00 3.00 4.00 3.00 2.00 2.0012 2.00 5.00 4.00 6.00 4.00 2.00 2.0013 1.00 6.00 9.00 6.00 5.00 2.00 1.0014 1.00 6.00 8.00 3.00 2.00 2.00 2.0015 1.00 6.00 5.00 5.00 4.00 1.00 2.0016 2.00 4.00 3.00 4.00 3.00 2.00 2.0017 1.00 6.00 9.00 5.00 3.00 1.00 1.0018 1.00 4.00 4.00 5.00 4.00 1.00 2.0019 1.00 7.00 14.00 6.00 6.00 1.00 1.0020 2.00 6.00 6.00 6.00 4.00 2.00 2.0021 1.00 6.00 9.00 4.00 2.00 2.00 2.0022 1.00 5.00 5.00 5.00 4.00 2.00 1.0023 2.00 3.00 2.00 4.00 2.00 2.00 2.0024 1.00 7.00 15.00 6.00 6.00 1.00 1.0025 2.00 6.00 6.00 5.00 3.00 1.00 2.0026 1.00 6.00 13.00 6.00 6.00 1.00 1.0027 2.00 5.00 4.00 5.00 5.00 1.00 1.0028 2.00 4.00 2.00 3.00 2.00 2.00 2.00 29 1.00 4.00 4.00 5.00 3.00 1.00 2.0030 1.00 3.00 3.00 7.00 5.00 1.00 2.00

Table 15.1

Page 18: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-18

Frequency Distribution

In a frequency distribution, one variable is considered at a time.

A frequency distribution for a variable produces a table of frequency counts, percentages, and cumulative percentages for all the values associated with that variable.

Page 19: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-19

Frequency Distribution of Familiaritywith the Internet

Table 15.2

Valid Cumulative Value label Value Frequency (N) Percentage percentage percentage Not so familiar 1 0 0.0 0.0 0.0 2 2 6.7 6.9 6.9 3 6 20.0 20.7 27.6 4 6 20.0 20.7 48.3 5 3 10.0 10.3 58.6 6 8 26.7 27.6 86.2 Very familiar 7 4 13.3 13.8 100.0 Missing 9 1 3.3 TOTAL 30 100.0 100.0

Page 20: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-20

Frequency Histogram

Fig. 15.1

2 3 4 5 6 70

7

4

3

2

1

6

5

Frequ

en

cy

Familiarity

8

Page 21: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-21

The mean, or average value, is the most commonly used measure of central tendency. The mean, ,is given by

Where, Xi = Observed values of the variable X n = Number of observations (sample size)

The mode is the value that occurs most frequently. The mode is a good measure of location when the variable is inherently categorical or has otherwise been grouped into categories.

Statistics for Frequency Distribution: Measures of Location

X = X i/ni=1

nX

Page 22: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-22

The median of a sample is the middle value when the data are arranged in ascending or descending order.

If the number of data points is even, the median is the midpoint between the two middle values. The median is the 50th percentile.

Statistics for Frequency Distribution: Measures of Location

Page 23: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-23

The range measures the spread of the data.

The variance is the mean squared deviation from the mean. The variance can never be negative.

The standard deviation is the square root of the variance.

The coefficient of variation is the ratio of the standard deviation to the mean expressed as a percentage, and is a unitless measure of relative variability.

CV = sx/X

Statistics for Frequency Distribution: Measures of Variability

Page 24: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-24

Skewness. The tendency of the deviations from the mean to be larger in one direction than in the other. Tendency for one tail of the distribution to be heavier than the other.

Kurtosis is a measure of the relative peakedness or flatness of the frequency distribution curve. The kurtosis of a normal distribution is zero.

-kurtosis>0, then dist is more peaked than normal dist.

-kurtosis<0, then dist is flatter than a normal distribution.

Statistics for Frequency Distribution: Measures of Shape

Page 25: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-25

Skewness of a DistributionFig. 15.2

Skewed Distribution

Symmetric Distribution

Mean Media

n Mode

(a)Mean Median

Mode (b)

Page 26: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-26

Cross-Tabulation

While a frequency distribution describes one variable at a time, a cross-tabulation describes two or more variables simultaneously.

Cross-tabulation results in tables that reflect the joint distribution of two or more variables with a limited number of categories or distinct values, e.g., Table 15.3.

Page 27: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-27

Gender and Internet UsageTable 15.3

Gender

RowInternet Usage Male Female Total

Light (1) 5 10 15

Heavy (2) 10 5 15

Column Total 15 15

Page 28: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-28

Two Variables Cross-Tabulation

Since two variables have been cross-classified, percentages could be computed either columnwise, based on column totals (Table 15.4), or rowwise, based on row totals (Table 15.5).

The general rule is to compute the percentages in the direction of the independent variable, across the dependent variable. The correct way of calculating percentages is as shown in Table 15.4.

Page 29: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-29

Internet Usage by GenderTable 15.4

Gender Internet Usage Male Female Light 33.3% 66.7% Heavy 66.7% 33.3% Column total 100% 100%

Page 30: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-30

Gender by Internet UsageTable 15.5

Internet Usage Gender Light Heavy Total Male 33.3% 66.7% 100.0% Female 66.7% 33.3% 100.0%

Page 31: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-31

Introduction of a Third Variable in Cross-Tabulation

Refined Association

between the Two Variables

No Association between the Two

Variables

No Change in the Initial

Pattern

Some Association

between the Two Variables

Fig. 15.7

Some Association between the Two

Variables

No Association between the Two

Variables

Introduce a Third Variable

Introduce a Third Variable

Original Two Variables

Page 32: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-32

As can be seen from Table 15.6, 52% (31%) of unmarried (married) respondents fell in the high-purchase category

Do unmarried respondents purchase more fashion clothing?

A third variable, the buyer's sex, was introduced As shown in Table 15.7,

- 60% (25%) of unmarried (married) females fell in the high-purchase category - 40% (35%) of unmarried (married) males fell in the high-purchase category.

Unmarried respondents are more likely to fall in the high purchase category than married ones, and this effect is much more pronounced for females than for males.

3 Variables Cross-Tab:Refine an Initial Relationship

Page 33: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-33

Purchase of Fashion Clothing by Marital StatusTable 15.6

Purchase of Fashion

Current Marital Status

Clothing Married Unmarried

High 31% 52%

Low 69% 48%

Column 100% 100%

Number of respondents

700 300

Page 34: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-34

Purchase of Fashion Clothing by Marital Status and GenderTable 15.7

Purchase of Fashion Clothing

Sex Male

Female

Married Not Married

Married Not Married

High 35% 40% 25% 60%

Low 65% 60% 75% 40%

Column totals

100% 100% 100% 100%

Number of cases

400 120 300 180

Page 35: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-35

Table 15.8 shows that 32% (21%) of those with (without) college degrees own an expensive automobile

Income may also be a factor

In Table 15.9, when the data for the high income and low income groups are examined separately, the association between education and ownership of expensive automobiles disappears,

Initial relationship observed between these two variables was spurious.

3 Variables Cross-Tab:Initial Relationship was Spurious

Page 36: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-36

Ownership of Expensive Automobiles by Education Level

Table 15.8

Own Expensive Automobile

Education

College Degree No College Degree

Yes

No

Column totals

Number of cases

32%

68%

100%

250

21%

79%

100%

750

Page 37: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-37

Ownership of Expensive Automobiles by Education Level and Income Levels

Table 15.9

Own Expensive Automobile

College Degree

No College Degree

College Degree

No College Degree

Yes 20% 20% 40% 40%

No 80% 80% 60% 60%

Column totals 100% 100% 100% 100%

Number of respondents

100 700 150 50

Low Income High Income

Income

Page 38: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-38

Table 15.10 shows no association between desire to travel abroad and age.

In Table 15.11, sex was introduced as the third variable.

Controlling for effect of sex, the suppressed association between desire to travel abroad and age is revealed for the separate categories of males and females.

Since the association between desire to travel abroad and age runs in the opposite direction for males and females, the relationship between these two variables is masked when the data are aggregated across sex as in Table 15.10.

3 Variables Cross-Tab:Reveal Suppressed Association

Page 39: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-39

Desire to Travel Abroad by AgeTable 15.10

Desire to Travel Abroad Age

Less than 45 45 or More

Yes 50% 50%

No 50% 50%

Column totals 100% 100%

Number of respondents 500 500

Page 40: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-40

Desire to Travel Abroad by Age and Gender

Table 15.11

Page 41: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-41

Consider the cross-tabulation of family size and the tendency to eat out frequently in fast-food restaurants as shown in Table 15.12. No association is observed.

When income was introduced as a third variable in the analysis, Table 15.13 was obtained. Again, no association was observed.

Three Variables Cross-TabulationsNo Change in Initial Relationship

Page 42: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-42

Eating Frequently in Fast-Food Restaurants by Family Size

Table 15.12

Page 43: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-43

Eating Frequently in Fast Food-Restaurantsby Family Size and Income

Table 15.13

Page 44: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-44

H0: there is no association between the two variables

Use chi-square statistic

H0 will be rejected when the calculated value of the test statistic is greater than the critical value of the chi-square distribution

Statistics Associated with Cross-Tab: Chi-Square

Page 45: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-45

Statistics Associated with Cross-Tab: Chi-Square

compares the of the observed cell frequencies (fo) to the frequencies to be expected when there is no association between variables (fe)

The expected frequency for each cell can be calculated by using a simple formula:

nr=total number in the row

nc=total number in the column

n=total sample size

2

n

nnf cr

e

Page 46: © 2007 Prentice Hall16-1 Some Preliminaries. © 2007 Prentice Hall16-2 Basics of Analysis The process of data analysis Example 1: Gift Catalog Marketer

© 2007 Prentice Hall 16-46

From Table 3 in the Statistical Appendix, the probability of exceeding a chi-square value of 3.841 is 0.05.

The calculated chi-square is 3.333. Since this is less than the critical value of 3.841, the null hypothesis can not be rejected

Thus, the association is not statistically significant at the 0.05 level.

Statistics for Cross-Tab: Chi-Square