by: miss lucia kulatau christopher

41
BY: MISS LUCIA KULATAU & CHRISTOPHER Data Analysis & Reporting DAR611S FM

Upload: others

Post on 25-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

B Y : M I S S L U C I A K U L A T A U

&

C H R I S T O P H E R

Data Analysis & Reporting DAR611S – FM

Unit 1: Introduction Data Analysis

Objectives:

After completion of this unit you will be able to:

Define data

Understand the types of data and their differences

Know what is raw data

What is Data Analysis and Reporting

It is a process of organising, analysing andinterpreting data into meaningful information. Thisprocess is driven by getting data from the rightsource or data set, which is to be used to make adecision or reach a conclusion.

What is data? Data – is a plural form for the Latin word datum.

Data means given or fact.- For instance, if I ask you your name, your answer to my

question is a fact.- if I go on to ask you your age, your place of birth, your

religion, the political party you belong to, your height, the number of children you have, etc; the answers you provide to my questions are all facts concerning yourself. Hence, they are data.

Data is raw information which still needs to be processed by statistical tools to make sense.

Difference between Data and Information

Data –Raw, still requires

processing

Presented as

information.

Processing –statistical

tools.

Data means nothing until its processed.

Types of Data

Data comes in several forms

Knowing the type of data helps to decide how best to collect it.

It also determines the appropriate ways to display it.

Qualitative or

Quantitative.

Qualitative versus Quantitative

Previously we posed several questions the answers to which provided us with „data‟.

The questions were name, age, place of birth, religion, political party of affiliation, height, and number of children. The answers to those questions can be summarised into two categories:

(i). those that are numerical (i.e. involve numbers); and

(ii). those that are non-numeric (i.e. do not involve numbers).

Quantitative Data

Quantitative - in the literal sense that some quantity has to be mentioned.

Example - Questions referred to earlier that yield quantitative data are age, height and number of children. One has to mention some quantity in answering each of those questions, even if that quantity happens to be zero, e.g. in mentioning the number of children one has.

Qualitative

Qualitative (or categorical) in the sense that they are non-numeric and all they do is to spell out some quality or category of the object being referred to.

Example: The rest of the questions yield only qualitative (or categorical) data e.g. name, place of birth, religion and one‟s political affiliation. No quantity is entailed in answering any of those questions. Just one‟s „quality‟ or „category‟ is required.

Discrete versus Continuous Data

Height and age can take on decimals and fractions while the answer to the last question (viz. number of children) can only be a whole number, i.e. an integer. While somebody can be 22 ½ years old and 1.78metres tall, one can never have 3.5 children, or anything like that. It can only be 0, 1,2, 3, etc. children.

Quantitative data that can assume only whole numbers are said to be discrete.

Quantitative data that can take on decimals and fractions are said to be continuous.

In order to have discrete or continuous data, the data

themselves have to be quantitative in the first place.

Primary versus Secondary Data

Primary data are the ones we collect ourselves and use them for the very purpose we collected them for. On the other hand.

Secondary data are the ones that were collected by somebody else and we just adapt them to our current needs.

NB: The distinction between the two lies in two conditions:

Who collected the data?

How are the data used?

Example of Primary Data

I may conduct a survey in Katutura to study the „Living conditions of the formerly disadvantaged groups of Namibians‟. The report I write on that will be based on primary data since it is I myself who collected the data and I used the data for the very purpose I collected them for.

Note, however, that by saying „I‟ collected the data myself does not necessarily mean that I went personally to the various households to collect the data. I may just have hired some research assistants to do the job for me but since I am the one behind the survey I am held responsible for whatever outcome of the exercise.

Hence, I will be said to have collected the data although I may not have set my foot in Katutura. Therefore the data are primary since both conditions are satisfied.

Example of Secondary Data

Now, suppose after writing the report I decide to write also on „Patterns of income and expenditure among the formerly disadvantaged groups in Namibia‟.

In my earlier survey I may have collected some data on income and expenditure, which I may decide to adapt for the present need. In this case the data will be considered secondary since the use they have been put to is different from the original intended purpose.

Indeed the data may not even suffice for the new objective since they were not collected for that purpose.

Types of Data

Raw Data

Raw data (sometimes called source data or atomic data) is data that has not been processed for use.

A distinction is sometimes made between data and information to the effect that information is the end product of data processing.

Example of Raw Data

Although raw data has the potential to become "information," it requires selective extraction, organization, and sometimes analysis and formatting for presentation. For example, a point-of-sale terminal (POS terminal) in a busy supermarket collects huge volumes of raw data each day, but that data doesn't yield much information until it is processed.

Once processed, the data may indicate the particular items that each customer buys, when they buy them, and at what price. Such information can be further subjected to predictive technology analysis to help the owner plan future marketing campaigns. As a result of processing, raw data sometimes ends up in a database, which enables the data to become accessible for further processing and analysis in a number of different ways.

Tools to interpret raw Data

There are numerous statistical techniques and tools that are available for use.

The type of data one is dealing with has a lot to do with the technique or tool chosen to analyse it.

There are two broader techniques that are used in Parametric and Nonparametric tests.

Here is a summary of the major points you have learnt:

Data are facts and statistics collected together for reference or analysis purposes.

Data can be either numeric (quantitative) or non-numeric (quantitative) .

Data that has not been processed for use is called raw data.

Quantitative data are available in two types: discrete or continuous .

Data can be readily available (secondary) or you can collect them yourself for the first time and use (primary).

Home work

The following variables relate to the Hosea Kutako International Airport. Management is collecting the information with a view to upgrading services at the airport. Which variables yield discrete data and which ones yield continuous data?

Average mid-day temperature on the runway. Weight carried by a landing aeroplane. Number of passengers going through the departure lounge daily. Number of planes landing in a day. Length of delay time for international flights. Length of queues at check-in points on a normal day. Number of passengers left behind because of overbooking. The number of VIPs using the airport in a month. The average number of crew per departing flight. Number of complaints about missing luggage per arriving flight.

Lesson of the Day

Opportunities don't happen, you create them. ~Chris Grosser

Unit 2: Parametric and Non-parametric

Objectives

Upon completion of this unit you will be able to:

distinguish between parametric and non-parametric analysis

relate to the different tests nonparametric tests

apply the different nonparametric tests

Normal Distribution

Features of a normal distribution

Its distribution has a bump in the middle, with tails going down and out to the left and right (bell shaped)

Its shape is Symmetric

It‟s Mean is equal to Median

The total area under the curve is 1 (or 100%)

Normal Distribution has the same shape as Standard Normal Distribution

μ = 0 and σ)=1

Plots all of its values in a symmetrical fashion and most of the results are situated around the probability's mean.

Values are equally likely to plot either above or below the mean

Violation of Normality

A use of histogram A histogram is a visual

summary of the distribution of values

Skewness

kurtosis

ways in which data can deviate from normal

Lack of symmetry (skewness) the most frequent scores (the tall

bars on the graph) are clustered on one end of the scale

can be positively or negatively skewed

Pointiness or Peakedness(kurtosis)

When there are outliers Limits of detection and when the outcome is an

ordinal variable or a rank

What are parametric tests?

Parametric tests are the traditional tests in hypotheses testing.

They depend on the specification of a probability distribution.

Parametric tests are said to depend on distributional assumptions

A sample statistic is obtained to estimate the population parameter.

When the assumptions are not met in the sample data, the statistic may not be a good estimation to the parameter.

Assumptions

Observations are independent.

The sample data have a normal distribution.

Scores in different groups have homogeneous variances

Examples of Parametric test

T test

F test

Z test

ANOVA

What are nonparametric tests

Nonparametric tests do not depend on probability distribution.

Do not rely on restrictive assumptions of parametric tests.

They do not assume that data come from a normal distribution

Nonparametric methods are normally used when: the distribution is not known.

the data are interval or ratio scaled and

the sample size is large,

Types of Nonparametric Statistical Tests

We can categorise Nonparametric Statistical Tests in to two groups:

Single sample tests and

Two sample tests

Single sample tests

Many of these tests require that the data are ordinal and can be ranked.

1. Wilcoxon Signed Ranks Test

Used to determine whether or not the median value of a sample equals a specified value.

Example:

let θ represent the median value and

μ the mean value

Wilcoxon Signed Ranks Test cont…

Now let us construct our null and alternative hypotheses .

H0: θ = μ

HA: θ ≠ μ or θ > μ or θ < μ

I listed list 3 alternative hypotheses –hopefully you all know enough about statistical testing to know that we only specify one of the three.

Wilcoxon Signed Ranks Test cont…

Assumptions:

The sample has been randomly selected from the population it represents.

The original scores are in interval/ratio format.

The underlying population distribution is symmetric.

Note: The Wilcoxon signed ranks test ranks difference scores, which are the differences between the actual score and the hypothesized median.

2. Kruskal-Wallis Test

We use Kruskal-Wallis test when we are comparing three or more samples.

The null hypothesis: all populations have identical distribution functions

Alternative hypothesis that at least two of the samples differ only with respect to location (median).

Using the Kruskal-Wallis Test, we can decide whether the population distributions are identical without assuming them to follow the normal distribution

3. Kolmogorov-Smirnov goodness of fit test for a single sample (KS Test)

We use to determine if a distribution of sample observations conforms to a specified probability distribution.

The distribution may be a theoretical one (like the uniform or normal) or may be one derived from previous empirical observation. H0 : F (X) = F0 (X) HA : F (X) ≠ F0 (X)F0 (X) is the specified distribution.

Assumption Variable should be continuous Data must be ordinal as a cumulative frequency

distribution must be constructed.

4 Chi-square goodness of fit test for a single sample

A test which is used to determine if observed cell frequencies differ from expected frequencies.

It is used with categorical data, (e.i) this test can be used to answer questions such as whether or not a die is fair. H0: Observed cell frequencies are equal to expected cell

frequencies for all cells

HA: Observed frequency is not equal to expected frequency for at least one cell

If only one cell frequency is not equal to its expected frequency then the null hypothesis is rejected.

4 Chi-square Cont…

Assumptions

Data is categorical/nominal

Random sample of n independent observations

Expected frequency of each cell is at least 5

5 Binomial sign test for a single sample

We use this test when the data can be categorized into 2 groups. The test determines if the proportion of observations in one group equals a specific value. H0: π1 = µ

HA: π1 ≠ µ

Where, π1 referred to proportion of observations in group 1 and µ referred to the specified value. Example, If the exercise were to determine whether or not a coin was fair, then µ = 0.5.

Assumptions:

Each of n independent observations is randomly selected from a population

Each observation can be classified into 1 of 2 mutually exclusive groups

6 Single sample runs test

Can be used to determine if the distribution of a series of binary events in a population is random.

Most of the other tests have considered whether or not the sample data are consistent with a particular distribution. While in single sample runs test our aim is to determine if there is a bias in the distribution of the events.

H0: The events in the underlying population represented by the sample are distributed randomly.

HA: The events are distributed non-randomly.

In order to conduct this test we need to know the number of runs in the series of observations. A run is a sequence within a series in which one of the 2 alternatives occurs on consecutive trials.

7 Spearman‟s Rank correlation (rho)

Two or more sample tests

1. Mann-Whitney U test

We use this test to determine if two independent samples represent two populations

Use different median values. H0: θ1 = θ2 (the two medians are equal)

HA: θ1 ≠ θ2

2. Kolmogorov-Smirnov test for 2 independent samples

This test is similar to the one sample Kolmogorov-Smirnov test, except that now we are testing to determine if two independent samples represent 2 different populations

3. Wilcoxon matched-pairs signed-ranks test

It is the most useful test to see whether the members of a pair differ in size.

Before Wilcoxon matched-pairs signed-ranks test, let me introduce you to a mew term “matched-pairs experimental design”.

This is actually an experimental design that focuses on matching one participant up with another, but the person you are matched with should share similar or exact traits such as age or ethnicity.

The two samples would be dependent because the decisions are made by the same individual.

Wilcoxon matched-pairs signed-ranks test cont...

The Wilcoxon matched-pairs signed-ranks test can therefore, be used to determine if two dependent samples represent two different populations. H0: θD = 0 HA: θD ≠ 0

The null hypothesis is that the median of the difference scores equals zero.

Assumptions Sample of n subjects has been randomly selected from the population it

represents Data is ordinal Distribution of the difference scores in the populations represented by the

two samples is symmetric about the median of the population of difference scores

4. Friedman test We use the Friedman test for testing the difference between several related

samples.

Summary: Parametric vs. non-parametric tests

Parametric Non-parametric

Assumed distribution Normal Any

Assumed variance Homogeneous Any

Typical data Ratio or Interval Ordinal or Nominal

Data set relationships Independent Any

Usual central measure Mean Median

Benefits Can draw more conclusions Simplicity; Less affected by

outliers

Tests

Choosing Choosing parametric test Choosing a non-parametric test

Correlation test Pearson Spearman

Independent measures, 2 groups Independent-measures t-test Mann-Whitney test

Independent measures, >2 groups One-way, independent-measures

ANOVA Kruskal-Wallis test

Repeated measures, 2 conditions Matched-pair t-test Wilcoxon test

Repeated measures, >2

conditions

One-way, repeated measures

ANOVA Friedman's test