by: miss lucia kulatau christopher
TRANSCRIPT
B Y : M I S S L U C I A K U L A T A U
&
C H R I S T O P H E R
Data Analysis & Reporting DAR611S – FM
Unit 1: Introduction Data Analysis
Objectives:
After completion of this unit you will be able to:
Define data
Understand the types of data and their differences
Know what is raw data
What is Data Analysis and Reporting
It is a process of organising, analysing andinterpreting data into meaningful information. Thisprocess is driven by getting data from the rightsource or data set, which is to be used to make adecision or reach a conclusion.
What is data? Data – is a plural form for the Latin word datum.
Data means given or fact.- For instance, if I ask you your name, your answer to my
question is a fact.- if I go on to ask you your age, your place of birth, your
religion, the political party you belong to, your height, the number of children you have, etc; the answers you provide to my questions are all facts concerning yourself. Hence, they are data.
Data is raw information which still needs to be processed by statistical tools to make sense.
Difference between Data and Information
Data –Raw, still requires
processing
Presented as
information.
Processing –statistical
tools.
Data means nothing until its processed.
Types of Data
Data comes in several forms
Knowing the type of data helps to decide how best to collect it.
It also determines the appropriate ways to display it.
Qualitative or
Quantitative.
Qualitative versus Quantitative
Previously we posed several questions the answers to which provided us with „data‟.
The questions were name, age, place of birth, religion, political party of affiliation, height, and number of children. The answers to those questions can be summarised into two categories:
(i). those that are numerical (i.e. involve numbers); and
(ii). those that are non-numeric (i.e. do not involve numbers).
Quantitative Data
Quantitative - in the literal sense that some quantity has to be mentioned.
Example - Questions referred to earlier that yield quantitative data are age, height and number of children. One has to mention some quantity in answering each of those questions, even if that quantity happens to be zero, e.g. in mentioning the number of children one has.
Qualitative
Qualitative (or categorical) in the sense that they are non-numeric and all they do is to spell out some quality or category of the object being referred to.
Example: The rest of the questions yield only qualitative (or categorical) data e.g. name, place of birth, religion and one‟s political affiliation. No quantity is entailed in answering any of those questions. Just one‟s „quality‟ or „category‟ is required.
Discrete versus Continuous Data
Height and age can take on decimals and fractions while the answer to the last question (viz. number of children) can only be a whole number, i.e. an integer. While somebody can be 22 ½ years old and 1.78metres tall, one can never have 3.5 children, or anything like that. It can only be 0, 1,2, 3, etc. children.
Quantitative data that can assume only whole numbers are said to be discrete.
Quantitative data that can take on decimals and fractions are said to be continuous.
In order to have discrete or continuous data, the data
themselves have to be quantitative in the first place.
Primary versus Secondary Data
Primary data are the ones we collect ourselves and use them for the very purpose we collected them for. On the other hand.
Secondary data are the ones that were collected by somebody else and we just adapt them to our current needs.
NB: The distinction between the two lies in two conditions:
Who collected the data?
How are the data used?
Example of Primary Data
I may conduct a survey in Katutura to study the „Living conditions of the formerly disadvantaged groups of Namibians‟. The report I write on that will be based on primary data since it is I myself who collected the data and I used the data for the very purpose I collected them for.
Note, however, that by saying „I‟ collected the data myself does not necessarily mean that I went personally to the various households to collect the data. I may just have hired some research assistants to do the job for me but since I am the one behind the survey I am held responsible for whatever outcome of the exercise.
Hence, I will be said to have collected the data although I may not have set my foot in Katutura. Therefore the data are primary since both conditions are satisfied.
Example of Secondary Data
Now, suppose after writing the report I decide to write also on „Patterns of income and expenditure among the formerly disadvantaged groups in Namibia‟.
In my earlier survey I may have collected some data on income and expenditure, which I may decide to adapt for the present need. In this case the data will be considered secondary since the use they have been put to is different from the original intended purpose.
Indeed the data may not even suffice for the new objective since they were not collected for that purpose.
Raw Data
Raw data (sometimes called source data or atomic data) is data that has not been processed for use.
A distinction is sometimes made between data and information to the effect that information is the end product of data processing.
Example of Raw Data
Although raw data has the potential to become "information," it requires selective extraction, organization, and sometimes analysis and formatting for presentation. For example, a point-of-sale terminal (POS terminal) in a busy supermarket collects huge volumes of raw data each day, but that data doesn't yield much information until it is processed.
Once processed, the data may indicate the particular items that each customer buys, when they buy them, and at what price. Such information can be further subjected to predictive technology analysis to help the owner plan future marketing campaigns. As a result of processing, raw data sometimes ends up in a database, which enables the data to become accessible for further processing and analysis in a number of different ways.
Tools to interpret raw Data
There are numerous statistical techniques and tools that are available for use.
The type of data one is dealing with has a lot to do with the technique or tool chosen to analyse it.
There are two broader techniques that are used in Parametric and Nonparametric tests.
Here is a summary of the major points you have learnt:
Data are facts and statistics collected together for reference or analysis purposes.
Data can be either numeric (quantitative) or non-numeric (quantitative) .
Data that has not been processed for use is called raw data.
Quantitative data are available in two types: discrete or continuous .
Data can be readily available (secondary) or you can collect them yourself for the first time and use (primary).
Home work
The following variables relate to the Hosea Kutako International Airport. Management is collecting the information with a view to upgrading services at the airport. Which variables yield discrete data and which ones yield continuous data?
Average mid-day temperature on the runway. Weight carried by a landing aeroplane. Number of passengers going through the departure lounge daily. Number of planes landing in a day. Length of delay time for international flights. Length of queues at check-in points on a normal day. Number of passengers left behind because of overbooking. The number of VIPs using the airport in a month. The average number of crew per departing flight. Number of complaints about missing luggage per arriving flight.
Unit 2: Parametric and Non-parametric
Objectives
Upon completion of this unit you will be able to:
distinguish between parametric and non-parametric analysis
relate to the different tests nonparametric tests
apply the different nonparametric tests
Normal Distribution
Features of a normal distribution
Its distribution has a bump in the middle, with tails going down and out to the left and right (bell shaped)
Its shape is Symmetric
It‟s Mean is equal to Median
The total area under the curve is 1 (or 100%)
Normal Distribution has the same shape as Standard Normal Distribution
μ = 0 and σ)=1
Plots all of its values in a symmetrical fashion and most of the results are situated around the probability's mean.
Values are equally likely to plot either above or below the mean
Violation of Normality
A use of histogram A histogram is a visual
summary of the distribution of values
Skewness
kurtosis
ways in which data can deviate from normal
Lack of symmetry (skewness) the most frequent scores (the tall
bars on the graph) are clustered on one end of the scale
can be positively or negatively skewed
Pointiness or Peakedness(kurtosis)
When there are outliers Limits of detection and when the outcome is an
ordinal variable or a rank
What are parametric tests?
Parametric tests are the traditional tests in hypotheses testing.
They depend on the specification of a probability distribution.
Parametric tests are said to depend on distributional assumptions
A sample statistic is obtained to estimate the population parameter.
When the assumptions are not met in the sample data, the statistic may not be a good estimation to the parameter.
Assumptions
Observations are independent.
The sample data have a normal distribution.
Scores in different groups have homogeneous variances
Examples of Parametric test
T test
F test
Z test
ANOVA
What are nonparametric tests
Nonparametric tests do not depend on probability distribution.
Do not rely on restrictive assumptions of parametric tests.
They do not assume that data come from a normal distribution
Nonparametric methods are normally used when: the distribution is not known.
the data are interval or ratio scaled and
the sample size is large,
Types of Nonparametric Statistical Tests
We can categorise Nonparametric Statistical Tests in to two groups:
Single sample tests and
Two sample tests
Single sample tests
Many of these tests require that the data are ordinal and can be ranked.
1. Wilcoxon Signed Ranks Test
Used to determine whether or not the median value of a sample equals a specified value.
Example:
let θ represent the median value and
μ the mean value
Wilcoxon Signed Ranks Test cont…
Now let us construct our null and alternative hypotheses .
H0: θ = μ
HA: θ ≠ μ or θ > μ or θ < μ
I listed list 3 alternative hypotheses –hopefully you all know enough about statistical testing to know that we only specify one of the three.
Wilcoxon Signed Ranks Test cont…
Assumptions:
The sample has been randomly selected from the population it represents.
The original scores are in interval/ratio format.
The underlying population distribution is symmetric.
Note: The Wilcoxon signed ranks test ranks difference scores, which are the differences between the actual score and the hypothesized median.
2. Kruskal-Wallis Test
We use Kruskal-Wallis test when we are comparing three or more samples.
The null hypothesis: all populations have identical distribution functions
Alternative hypothesis that at least two of the samples differ only with respect to location (median).
Using the Kruskal-Wallis Test, we can decide whether the population distributions are identical without assuming them to follow the normal distribution
3. Kolmogorov-Smirnov goodness of fit test for a single sample (KS Test)
We use to determine if a distribution of sample observations conforms to a specified probability distribution.
The distribution may be a theoretical one (like the uniform or normal) or may be one derived from previous empirical observation. H0 : F (X) = F0 (X) HA : F (X) ≠ F0 (X)F0 (X) is the specified distribution.
Assumption Variable should be continuous Data must be ordinal as a cumulative frequency
distribution must be constructed.
4 Chi-square goodness of fit test for a single sample
A test which is used to determine if observed cell frequencies differ from expected frequencies.
It is used with categorical data, (e.i) this test can be used to answer questions such as whether or not a die is fair. H0: Observed cell frequencies are equal to expected cell
frequencies for all cells
HA: Observed frequency is not equal to expected frequency for at least one cell
If only one cell frequency is not equal to its expected frequency then the null hypothesis is rejected.
4 Chi-square Cont…
Assumptions
Data is categorical/nominal
Random sample of n independent observations
Expected frequency of each cell is at least 5
5 Binomial sign test for a single sample
We use this test when the data can be categorized into 2 groups. The test determines if the proportion of observations in one group equals a specific value. H0: π1 = µ
HA: π1 ≠ µ
Where, π1 referred to proportion of observations in group 1 and µ referred to the specified value. Example, If the exercise were to determine whether or not a coin was fair, then µ = 0.5.
Assumptions:
Each of n independent observations is randomly selected from a population
Each observation can be classified into 1 of 2 mutually exclusive groups
6 Single sample runs test
Can be used to determine if the distribution of a series of binary events in a population is random.
Most of the other tests have considered whether or not the sample data are consistent with a particular distribution. While in single sample runs test our aim is to determine if there is a bias in the distribution of the events.
H0: The events in the underlying population represented by the sample are distributed randomly.
HA: The events are distributed non-randomly.
In order to conduct this test we need to know the number of runs in the series of observations. A run is a sequence within a series in which one of the 2 alternatives occurs on consecutive trials.
Two or more sample tests
1. Mann-Whitney U test
We use this test to determine if two independent samples represent two populations
Use different median values. H0: θ1 = θ2 (the two medians are equal)
HA: θ1 ≠ θ2
2. Kolmogorov-Smirnov test for 2 independent samples
This test is similar to the one sample Kolmogorov-Smirnov test, except that now we are testing to determine if two independent samples represent 2 different populations
3. Wilcoxon matched-pairs signed-ranks test
It is the most useful test to see whether the members of a pair differ in size.
Before Wilcoxon matched-pairs signed-ranks test, let me introduce you to a mew term “matched-pairs experimental design”.
This is actually an experimental design that focuses on matching one participant up with another, but the person you are matched with should share similar or exact traits such as age or ethnicity.
The two samples would be dependent because the decisions are made by the same individual.
Wilcoxon matched-pairs signed-ranks test cont...
The Wilcoxon matched-pairs signed-ranks test can therefore, be used to determine if two dependent samples represent two different populations. H0: θD = 0 HA: θD ≠ 0
The null hypothesis is that the median of the difference scores equals zero.
Assumptions Sample of n subjects has been randomly selected from the population it
represents Data is ordinal Distribution of the difference scores in the populations represented by the
two samples is symmetric about the median of the population of difference scores
4. Friedman test We use the Friedman test for testing the difference between several related
samples.
Summary: Parametric vs. non-parametric tests
Parametric Non-parametric
Assumed distribution Normal Any
Assumed variance Homogeneous Any
Typical data Ratio or Interval Ordinal or Nominal
Data set relationships Independent Any
Usual central measure Mean Median
Benefits Can draw more conclusions Simplicity; Less affected by
outliers
Tests
Choosing Choosing parametric test Choosing a non-parametric test
Correlation test Pearson Spearman
Independent measures, 2 groups Independent-measures t-test Mann-Whitney test
Independent measures, >2 groups One-way, independent-measures
ANOVA Kruskal-Wallis test
Repeated measures, 2 conditions Matched-pair t-test Wilcoxon test
Repeated measures, >2
conditions
One-way, repeated measures
ANOVA Friedman's test