data prep and descriptive stats

Upload: sunny-ramesh-sadnani

Post on 04-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Data Prep and Descriptive Stats

    1/51

    Data Preparation

  • 8/13/2019 Data Prep and Descriptive Stats

    2/51

    Steps in Data Preparation

    Editing

    Coding

    EnteringData

    Data Tabulation

    Reviewing Tabulations

    Statistically adjusting the data

  • 8/13/2019 Data Prep and Descriptive Stats

    3/51

    Editing

    Carefully checking survey data for Completeness (no omissions) Legibility (non-ambiguous)

    Right informant Consistency e.g. charging something when the person does not

    own a charge card

    Accuracy. Most important purpose is to eliminate or at

    least reduce the number of errors in the rawdata.

  • 8/13/2019 Data Prep and Descriptive Stats

    4/51

    1. Ideally re-interview respondent

    2. Eliminate all unacceptable surveys (case wisedeletion) (if sample is large and few unacceptable)

    3. In calculations only the cases with complete

    responses are considered (pair wise deletion)(means that some statistics will be based ondifferent sample sizes)

    4. Code illegible or missing answers into a a no valid

    response category5. substitute a neutral value - typically the mean

    response to the variable, therefore the meanremains unchanged

    Solutions

  • 8/13/2019 Data Prep and Descriptive Stats

    5/51

    Coding

    The process of systematically and consistentlyassigning each response a numerical score.

    The key to a good coding system is for the codingcategories to be mutually exclusive and the entiresystem to be collectively exhaustive.

    To be mutually exclusive, every response must fitinto only one category.

    To be collectively exhaustive, all possible

    responses must fit into one of the categories. Exhaustive means that you have covered the entire

    range of the variable with your measurement.

  • 8/13/2019 Data Prep and Descriptive Stats

    6/51

    Coding M issing Numbers: When respondents failto complete portions of the survey.

    Whatever the reason for incomplete surveys, you

    must indicate that there was no response providedby the respondent.

    For single digit responses code as 9, 2 digit code

    as 99

    Coding

  • 8/13/2019 Data Prep and Descriptive Stats

    7/51

    Coding Open-Ended Questions:When open-endedquestions are used, you must create categories.

    All responses must fit into a category

    similar responses should fall into the samecategory.

    e.g. Who services your car? ______________

    Possible categories: self, garage, husband, wife,friend, relative etc.

    To make it collectively exhaustive add an other ornone of the above category

    Only a few i.e. < 10% should fit into this category

  • 8/13/2019 Data Prep and Descriptive Stats

    8/51

    Are you: Male Female

    How satisfied are you with our product?

    ___Very Satisfied

    ___Somewhat Satisfied

    ___Somewhat Dissatisfied

    ___Very Dissatisfied

    ___No opinion

    Are you: (1)Male (2)Female

    How satisfied are you with our product?

    _1__Very Satisfied

    _2__Somewhat Satisfied

    _3__Somewhat Dissatisfied

    _4__Very Dissatisfied

    _5__No opinion

    This Becomes this

    Precoded Questionnaires: Sometimes you can placecodes on the actual questionnaire, which simplifies

    data entry.

  • 8/13/2019 Data Prep and Descriptive Stats

    9/51

    1. Are you solely responsible for taking care of yourautomotive service needs ___ Yes ___ No

    2. If No who performs the simple maintenance ___________3. If scheduled maintenance is done on your automobile,

    how do you keep track of what has been doneNot tracked

    auto dealer recordsmental recollectionother

    4. How often is your automobile serviced?Once per month

    Once every three monthsOnce every six monthsOnce per yearOther _______________

  • 8/13/2019 Data Prep and Descriptive Stats

    10/51

    Col.

    No

    Question

    No.

    Question Des. Range of permissible values

    1-3 ID # N/A 001-200

    4 1 Responsible for

    Maintenance

    0= No. 1=yes, 9= blank

    5 2 perform simple

    maintenance

    0=husband, 1=boyfriend, 2=father, 3=mother,

    4=relative, 5=friend, 6=other, 9=blank

    5 3 How maintenance

    tracked

    0=not tracked, 1=auto dealer records, 2=personal

    records, 3=mental recollection, 4=other, 9=blank

    6 4 How often

    maintenance

    performed

    Once per 0=month, 1= 3 months, 2= 6 months,

    3= year, 4= other 9=blank

    7 4 Other for how often

    Code Book

  • 8/13/2019 Data Prep and Descriptive Stats

    11/51

    In questions that permit multiple responses, each possible response

    option should be assigned a separate column

    6. Which magazines do you read, choose all that apply.

    Time National Geographic

    Readers Digest Chatelaine

    MacLean's

    Col. No Question No. Question Des. Range of permissible

    values

    15 6 Time 0 =read, 1= not read

    16 6 Readers Dig. 0 =read, 1= not read

    17 6 MacLean's

    0 =read, 1= not read

    18 6 National Geo. 0 =read, 1= not read

    19 6 Chatelaine 0 =read, 1= not read

  • 8/13/2019 Data Prep and Descriptive Stats

    12/51

    For rank order questions, separate columns are also needed

    7. Please rank the following brands of toothpaste in order of

    preference (1-5)Crest Colgate

    Aquafresh Arm & Hammer

    Pepsodent

    Col.# Q. No. Question Des. Range of permissible values20 7 Crest rank 0 =blank, 1 = most important, 2 =2ndmost

    important, 3 =third, 4=fourth, 5= fifth

    21 7 Colgate rank 0 =blank, 1 = most important, 2 =2ndmost

    important, 3 =third, 4=fourth, 5= fifth

    22 7 Acquafresh rank

    0 =blank, 1 = most important, 2 =2ndmost

    important, 3 =third, 4=fourth, 5= fifth

    23 7 A & H rank 0 =blank, 1 = most important, 2 =2ndmost

    important, 3 =third, 4=fourth, 5= fifth

    25 7 Pepsodent rank 0 =blank, 1 = most important, 2 =2nd

    mostimportant, 3 =third, 4=fourth, 5= fifth

  • 8/13/2019 Data Prep and Descriptive Stats

    13/51

  • 8/13/2019 Data Prep and Descriptive Stats

    14/51

    Entering Data

    Problems can occur during data entry, such astransposing numbers and inputting an infeasiblecode(e.g out of range)

    E.g. Score on range of 1-5 then 0, 6, 7, and 8 areunacceptable or out of range (might be due totranscription error)

    Always check the data-entry work.

  • 8/13/2019 Data Prep and Descriptive Stats

    15/51

    Descriptive Statistics

  • 8/13/2019 Data Prep and Descriptive Stats

    16/51

    Five types of statistical analysis

    Descriptive

    Inferential

    Differences

    Associative

    Predictive

    What are the characteristics of the respondents?

    What are the characteristics of the population?

    Are two or more groups the same or different?

    Are two or more variables related in a systematic way?

    Can we predict one variable if we know one or more

    other variables?

  • 8/13/2019 Data Prep and Descriptive Stats

    17/51

    Summarization of a collection of datain a clear and understandable way

    the most basic form of statistics

    lays the foundation for all statisticalknowledge

    Descriptive Statistics

  • 8/13/2019 Data Prep and Descriptive Stats

    18/51

    The tradeoff in descriptive statistics

    If you use fewer statistics to describe the distribution of a

    variable, you lose information but gain clarity.

    When should one use fewer statistics?

    When dropping the number of statistics would leave moreinformation per remaining statistic.

    When the information you drop is unimportant to ones research

    question.

  • 8/13/2019 Data Prep and Descriptive Stats

    19/51

    Type of

    Measurement

    Nominal

    Two

    categories

    More than

    two categories

    Frequency tableProportion (percentage)

    Frequency table

    Category proportions(percentages)

    Mode

    Type of

    descriptive analysis

  • 8/13/2019 Data Prep and Descriptive Stats

    20/51

    Ratio means

    Type of

    MeasurementType of

    descriptive analysis

    OrdinalRank order

    Median

    Interval Arithmetic mean

  • 8/13/2019 Data Prep and Descriptive Stats

    21/51

    Data Tabulation

    Tabulation: The organized arrangement of data ina table format that is easy to read andunderstand. Tabulate the data to count the number of responses to

    each question. Simple Tabulation: The tabulating of results of

    only one variable informs you how often eachresponse was given.

    Frequency Distr ibution: A distribution of datathat summarizes the number of times a certainvalue of a variable occurs and is expressed interms of percentages.

  • 8/13/2019 Data Prep and Descriptive Stats

    22/51

    The arrangement of statistical data in a row-and-

    column format that exhibits the count ofresponses or observations for each categoryassigned to a variable

    How many of certain brand users can be called loyal? What percentage of the market are heavy users and

    light users?

    How many consumers are aware of a new product? What brand is the Top of Mind of the market?

    Frequency Tables

  • 8/13/2019 Data Prep and Descriptive Stats

    23/51

    More on relative frequency distributions

    Rules for relative frequency distributions:

    Make sure each observation is in one and only one category.

    Use categories of equal width.

    Choose an appealing number of categories.

    Provide labels

    Double-check your graph.

    Definitions:

    A histogram is a relative frequency distribution of a quantitative

    variable A bar graph is a relative frequency distribution of a qualitative

    variable

  • 8/13/2019 Data Prep and Descriptive Stats

    24/51

    643 Netw orking213 print ad

    179 Online recruitment site

    112 Placement firm

    18 Temporary agency

    How did you find your last job?

    7006005004003002001000

    Netw orking

    print ad

    Online recruitment site

    Placement f irm

    Temporary agency

    55.2 %

    18.3 %

    15.4 %

    9.6 %

    1.5 %

    WebSurveyor Bar Chart

  • 8/13/2019 Data Prep and Descriptive Stats

    25/51

  • 8/13/2019 Data Prep and Descriptive Stats

    26/51

    How many times per week do you use mouthwash ?

    1__ 2__ 3__ 4__ 5__ 6__ 7__

    1 1 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7

    1 2

    2 3

    3 5

    4 7

    5 5

    6 3

    7 2

    0

    1

    2

    3

    4

    5

    6

    7

    1

    2

    3

    4

    5

    6

    7

  • 8/13/2019 Data Prep and Descriptive Stats

    27/51

    Normal Distribution

    - a b

    Curve is basically bell shaped

  • 8/13/2019 Data Prep and Descriptive Stats

    28/51

    Normal DistributionsCurve is basically bell shapedfrom - to symmetric with scores

    concentrated in the middle (i.e. onthe mean) than in the tails.

    Mean, medium and modecoincide

    They differ in how spread outthey are.

    The area under each curve is 1.

    The height of a normaldistribution can be specifiedmathematically in terms of twoparameters: the mean () and thestandard deviation ().

  • 8/13/2019 Data Prep and Descriptive Stats

    29/51

  • 8/13/2019 Data Prep and Descriptive Stats

    30/51

    Kurtosis: how peaked a distribution is. A

    zero indicates normal distribution, positivenumbers indicate a peak, negative numbers

    indicate a flatter distribution)

    Peaked

    distributionFlat distribution

    Thanks, Scott!

  • 8/13/2019 Data Prep and Descriptive Stats

    31/51

    Summary statistics

    central tendency

    Dispersion or variabilityA quantitative measure of the degree towhich scores in a distribution are spread

    out or are clustered together;

  • 8/13/2019 Data Prep and Descriptive Stats

    32/51

    Descriptive Analysis: Measures of

    Central Tendency

    Mode: the number that occurs most oftenin a string (nominal data)

    Median: half of the responses fall abovethis point, half fall below this point(ordinal data)

    Mean: the average (interval/ratio data)

    Mode

  • 8/13/2019 Data Prep and Descriptive Stats

    33/51

    Mode

    the most frequent categoryusers 25%

    non-users 75%Advantages:

    meaning is obvious

    the only measure of central tendency that can be usedwith nominal data.

    Disadvantages

    manydistributions have more than one mode, i.e. are"multimodal

    greatly subject to sample fluctuations

    therefore not recommended to be used as the only

    measure of central tendency.

    Median

  • 8/13/2019 Data Prep and Descriptive Stats

    34/51

    Median

    the middle observation of the datanumber times per week consumers use mouthwash

    1 1 2 2 2 3 3 3 3 3 4 4 4 44 4 4 5 5 5 5 5 6 6 6 7 7

    Frequency

    distribution ofMouthwash

    use per week

    Heavy userLight user Mode

    Median

    Mean

  • 8/13/2019 Data Prep and Descriptive Stats

    35/51

    The Mean (average value)

    sum of all the scores divided by the number of scores.

    a good measure of central tendency for roughlysymmetric distributions

    can be misleading in skewed distributions since it can begreatly influenced by extreme scores in which case otherstatistics such as the median may be more informative

    formula = X/N (population)

    X = xi/n (sample)

    where /X is the population/sample mean

    and N/n is the number of scores.

  • 8/13/2019 Data Prep and Descriptive Stats

    36/51

    Normal Distributions with

    different Mean

    0- 1 2

  • 8/13/2019 Data Prep and Descriptive Stats

    37/51

    Minimum, Maximum, and Range (Highestvalue minus the lowest value)

    Variance Standard Deviation (A measures distance

    from the mean)

    Measures of Dispersion or

    Variability

  • 8/13/2019 Data Prep and Descriptive Stats

    38/51

    Distribution of Final Course Grades in MGMT 3220Y

    0

    5

    10

    15

    20

    25

    Grade

    Freq

    uenc

    Frequency 3 10 20 23 12

    F D C B A

    RANGE

    - 1 SD

    + 1 SD

  • 8/13/2019 Data Prep and Descriptive Stats

    39/51

    Variance

    The difference between an observed value and themean is called the deviation from the mean

    The variance is the mean squared deviation from

    the mean

    i.e. you subtract each value from the mean,

    square each result and then take the average.

    Because it is squared it can never be negative

    2= S(x- xi)2/n

  • 8/13/2019 Data Prep and Descriptive Stats

    40/51

  • 8/13/2019 Data Prep and Descriptive Stats

    41/51

    Measures of Dispersion

    Suppose we are testing the new flavor of a fruit punch

    Dislike 1 2 3 4 5 Like Data

    1. 3

    2. 5

    3. 3

    4. 5

    5. 3

    6. 5

    x

    x

    x

    x

    x

    x

    X= 42= 1S = 1

    2= S(x- xi)2/n S = S(x- xi)2/n

  • 8/13/2019 Data Prep and Descriptive Stats

    42/51

    Measures of Dispersion

    Dislike 1 2 3 4 5 Like Data

    1. 5

    2. 4

    3. 5

    4. 5

    5. 5

    6. 4

    x

    x

    x

    xx

    x X = 4.62=0.26S = 0.52

    2= S(x- xi)2/n S = S(x- xi)2/n

  • 8/13/2019 Data Prep and Descriptive Stats

    43/51

    Measures of Dispersion

    Dislike 1 2 3 4 5 Like Data

    1. 1

    2. 5

    3. 1

    4. 5

    5. 1

    6. 5

    x

    x

    x

    x

    xx

    X= 32=4S = 2

    2= S(x- xi)2/n S = S(x- xi)2/n

  • 8/13/2019 Data Prep and Descriptive Stats

    44/51

    -

    123

    Normal Distributions

    with different SD

  • 8/13/2019 Data Prep and Descriptive Stats

    45/51

    A statistical technique that involves tabulating theresults of two or more variables simultaneously

    informs you how often each response was given

    Shows relationships among and between variables frequency distribution for each subgroup compared

    to the frequency distribution for the total sample

    must be nominally scaled

    Cross Tabulation

  • 8/13/2019 Data Prep and Descriptive Stats

    46/51

    Cross-tabulation

    Helps answer questions about whether twoor more variables of interest are linked:

    Is the type of mouthwash user (heavy or

    light) related to gender?Is the preference for a certain flavor (cherry

    or lemon) related to the geographic region(north, south, east, west)?

    Is income level associated with gender?

    Cross-tabulation determines association not

    causality.

  • 8/13/2019 Data Prep and Descriptive Stats

    47/51

    The variable being studied is called the

    dependent variableor response variable.

    A variable that influences the dependentvariable is called independent variable.

    Dependent and Independent Variables

  • 8/13/2019 Data Prep and Descriptive Stats

    48/51

    Cross-tabulation

    Cross-tabulation of two or more variables ispossible if the variables are discrete:

    The frequency of one variable is subdivided by theother variable categories.

    Generally a cross-tabulation table has: Row percentages

    Column percentages

    Total percentages Which one is better?

    DEPENDS on which variable is considered asindependent.

  • 8/13/2019 Data Prep and Descriptive Stats

    49/51

    A contingency table shows the conjoint

    distribution of two discrete variables

    This distribution represents the probabilityof observing a case in each cell

    Probability is calculated as:

    Contingency Table

    Observed casesTotal cases

    P=

  • 8/13/2019 Data Prep and Descriptive Stats

    50/51

    Cross tabulation

    GROUPINC * Gender Crosstabulation

    10 9 19

    52.6% 47.4% 100.0%

    55.6% 18.8% 28.8%15.2% 13.6% 28.8%

    5 25 30

    16.7% 83.3% 100.0%

    27.8% 52.1% 45.5%

    7.6% 37.9% 45.5%

    3 14 17

    17.6% 82.4% 100.0%16.7% 29.2% 25.8%

    4.5% 21.2% 25.8%

    18 48 66

    27.3% 72.7% 100.0%

    100.0% 100.0% 100.0%

    27.3% 72.7% 100.0%

    Count

    % with in GROUPINC

    % with in Gender% of Total

    Count

    % with in GROUPINC

    % with in Gender

    % of Total

    Count

    % with in GROUPINC

    % with in Gender

    % of Total

    Count

    % with in GROUPINC

    % with in Gender

    % of Total

    income

  • 8/13/2019 Data Prep and Descriptive Stats

    51/51

    Chi-square Test for Independence

    The Chi-square test for independence

    determines whether two variables are

    associated or not.H0: Two variables are independent

    H1: Two variables are not independent

    Chi-square test results are unstable if cell count is lower than 5