variables qualitative non- numerical · non-probability samples the probability ... the main...
Post on 24-May-2018
221 Views
Preview:
TRANSCRIPT
MAT0144
hmz/june2016 1
ERRATA: CHAPTER 1 INTRODUCTION TO STATISTICS Page 5: (please cancel “Types of data”) Variables
Variables can be classified as either qualitative or quantitative. Quantitative variables are further
classified as either discrete or continuous. The following chart summarises the classifications.
Variables
Quantitative
Numerical
Discrete
Continuous
Qualitative
Non-numerical
Can be counted. Can assume only fixed values with no intermediate values. Example: number of children, shoe size
Can be measured. Can assume any value in an interval. Example: height, duration, temperature, expenses.
Can be placed into distinct categories, according to some characteristic or attribute. Often referred to as categorical variables. Example: gender, race, colour.
MAT0144
hmz/june2016 2
Page 7: Levels of Measurement
In addition to being classified as qualitative or quantitative, variables can be classified by how they
are categorized, counted or measured. This type of classification uses measurement scales of which
there are four types: nominal, ordinal, interval and ratio. The types of variables and their relation to
the levels of measurement is summarized in the following chart.
Variables
Qualitative
Nominal
Ordinal
Quantitative
Discrete Interval
Continuous Ratio
MAT0144
hmz/june2016 3
Page 8:
The levels of measurement of variables are briefly explained in the following table.
LEVEL DESCRIPTION
Data are qualitative.
No inherent order between categories (we cannot say that one particular category is
better than another).
The lowest of the four ways to characterise data.
Deals with names, categories or labels.
Example: blood group, gender, yes or no response to a survey, favourite breakfast food,
race.
Data are qualitative.
Data can be ordered.
There are no meaningful differences between the data ranks (the difference between two
ranks of an ordinal scale cannot be assumed to be the same as the difference between two
other ranks).
The next level after nominal.
Example: ranks (1st, 2nd, 3rd), ranks (Good, Better, Best), likert scale (Strongly Agree/
Agree/ Neutral/ Disagree/ Strongly Disagree), examination grades, size of T-shirt.
Data are quantitative.
Data can be ranked.
No true 0 (“0” does not mean absence of the quantity being measured).
The next level after ordinal.
Example: temperature.
Temperature does not have a true 0 point even if one of the scaled values happens to
carry the name "zero." The Celcius scale illustrates the issue. 0℃ does not represent the
complete absence of temperature (the absence of any molecular kinetic energy).
Other examples: IQ, date when measured from an arbitrary epoch (AD, BC), direction
measured in degrees from true or magnetic north.
No
min
al
Ord
inal
In
terv
al
MAT0144
hmz/june2016 4
Data are quantitative.
0 is meaningful (“0"indicates absence of the quantity being measured).
The highest level of measurement.
Example: amount of money.
Money is measured on a ratio scale because, in addition to having the properties of an
interval scale, it has a true 0 point: if you have 0 money, this implies the absence of money.
Since money has a true 0 point, it makes sense to say that someone with RM50 has twice
as much money as someone with RM25.
Other examples: weight, height, time taken.
Example 4
State whether the following are nominal, ordinal, interval or ratio data.
(a) A Statistics test which a student took classified as either easy, difficult or very difficult and these
alternatives are coded 1, 2 and 3.
(b) The IQ scores of 300 MENSA members in Malaysia recorded upon signing up.
(c) The platelet count of dengue patients at a hospital recorded within three days of admission.
(d) The make (brand) of cars reviewed by a newspaper columnist in a year.
(e) A list of temperatures in degrees Kelvin for the month of May compiled by a meteorologist.
(f) The most expensive cars for the year 2015 listed by a car magazine.
Rat
io
MAT0144
hmz/june2016 5
Page 9:
Data Collection and Sampling Techniques
Sampling is the process of selecting a number of subjects for a study in such a way that the subjects
represent the larger group from which they were selected. The reason for conducting a sample survey is
to estimate the value of some attribute of a population. The true value of a population attribute is called a
population parameter. A sample statistic which is obtained from sample data is used as an estimate of a
population parameter.
The quality of a sample statistic (i.e., accuracy, precision, representativeness) is strongly affected by the
way that sample elements are chosen, that is, by the sampling method.
As a group, sampling methods fall into one of two categories.
Probability samples Each subject in the population has a known (non-zero) chance of being
chosen for the sample.
Non-probability samples The probability that each subject in the population will be chosen is
unknown, and/or it cannot be determined that each subject in the
population has a non-zero chance of being chosen.
In this syllabus, we will only discuss probability sampling methods. The key benefit of probability
sampling methods is that they guarantee that the sample chosen is representative of the population. This
ensures that the statistical conclusions will be valid.
The main types of probability sampling methods are simple random sampling, systematic sampling,
stratified sampling, and cluster sampling. These types of probability sampling methods are explained in
the table below.
MAT0144
hmz/june2016 6
DESCRIPTION
A basic type of sampling which can be a component of other more complicated sampling
methods.
Every subject in the population has an equal and known chance of being selected.
Subjects are selected by random numbers (random numbers can be generated by using
MS Excel).
Since it is free of classification error, it requires minimum advance knowledge of the
population. Its simplicity also makes it relatively easy to interpret data collected in this
manner.
Best suits situations where not much information is available about the population and
data collection can be efficiently conducted on randomly distributed items, or where the
cost of sampling is small enough to make efficiency less important than simplicity.
Also called an Nth name selection technique. After the required sample size has been
calculated, every Nth record is selected from a list of population subjects.
As long as the list does not contain any hidden order, this sampling method is as good as
the simple random sampling method.
Its only advantage over the random sampling technique is simplicity.
Frequently used to select a specified number of records from a computer file.
A commonly used probability method that is superior to random sampling because it
reduces sampling error.
A stratum is a subset of the population that shares at least one common characteristic.
Examples of strata might be males and females, juniors and seniors in a university, or
managers and non-managers, or based on geography (north, south, east, west).
The relevant strata and their actual representation in the population are identified. Simple
random sampling is then used to select a sufficient number of subjects from each stratum.
("Sufficient" refers to a sample size large enough for us to be reasonably confident that the
stratum represents the population).
Stratified sampling is often used when one or more of the strata in the population have a
low incidence relative to the other strata.
Sim
ple
Ran
do
m S
amp
lin
g Sy
stem
atic
Sam
pli
ng
Stra
tifi
ed S
amp
lin
g
MAT0144
hmz/june2016 7
The population is divided into groups, called clusters.
A number of clusters to be included in the sample is selected using a probability sampling
method (usually simple random sampling).
Each subject of the population can be assigned to one, and only one, cluster.
Only subjects within sampled clusters are surveyed.
There are two types of cluster sampling method:
One-stage sampling
All of the subjects within selected clusters are included in the sample
Two-stage sampling
A subset of subjects within selected clusters are randomly selected for inclusion in the
sample.
The main disadvantage of cluster sampling is it generally provides less precision than
either simple random sampling or stratified sampling.
Cluster sampling should only be used when it is economically justified. That is, when
reduced cost can be used to overcome the losses in precision.
One version of cluster sampling is area cluster sampling or geographical cluster sampling.
The difference between stratified sampling and cluster sampling methods:
Stratified Sampling Cluster Sampling
the sample includes elements from each
stratum.
the sample includes elements only from
sampled clusters.
Clu
ster
Sam
pli
ng
MAT0144
hmz/june2016 8
Page 10: (please note additional example)
Example 5
Classify each sample as random, systematic, stratified or cluster.
(a) Every tenth car owner using the valet service of a shopping mall in Kuantan is asked to rate the service.
(b) Employees of three oil and gas companies are selected using random numbers to determine annual
salaries.
(c) In a large school district, teachers from nine schools are interviewed to determine if they believe the
newly implemented school-based assessment system has been more effective than the old system.
(d) Students in a university are divided into six groups according to their gender and according to
whether they drive or take public transport to campus. Then 10 students are selected from each group
and interviewed to determine how long they take to come to class every day.
(e) Every 100th cupcake baked is checked to determine its trans-fat content.
Solution
Example 6
An auto analyst is conducting a satisfaction survey, sampling from a list of 10,000 consumers. The list
includes 2,500 Proton buyers, 2,500 Perodua buyers, 2,500 Honda buyers, and 2,500 Toyota buyers. The
analyst selects a sample of 400 car buyers, by randomly sampling 100 buyers of each brand.
Is this an example of simple random sampling? Explain.
MAT0144
hmz/june2016 9
Page 17: (please use this data set instead)
Example 4
The ages of owners in a new residential area are shown below. Construct a frequency distribution for the
data using seven equal classes. Compute the relative frequency and cumulative relative frequency of each
class.
41 54 47 40 39 35 50 37 49 42 70 32 44 52 39 50 40 30 34 69 39 45 33 42 44 63 60 27 42 34 50 42 52 38 36 45 35 43 48 46 31 27 55 63 46 33 60 62 45 56 45 34 53 50 50
Solution
MAT0144
hmz/june2016 10
Page 22: (please use these questions instead)
Exercise 1.2
3
4
The expenses (in RM) per visit of patients to a cardiologist’s clinic are tabled below. Construct
a frequency distribution using seven classes. Hence, draw a histogram, a frequency polygon
and an ogive for the data.
130 190 140 80 100 120 220 220 110 100
210 130 100 90 210 120 200 120 180 120
190 210 120 200 130 180 260 270 100 160
190 240 80 120 90 190 200 210 190 180
115 210 110 225 190 130
(a) From the histogram, give the class containing the cardiologist’s fee that most patients
pay.
(b) From the ogive, determine the middle value of the cardiologist’s fee.
A study was conducted to find out if part time jobs affect the academic performance of
secondary school and university students in Malaysia. The pie chart below gives a breakdown
of part time jobs that Malaysian students do.
(a) Are there any part time jobs that involve more than 25% of the students?
(b) Which two part time jobs appear to have the closest percentages of student involvement?
ConstructionSmall Business
Cashier
Food Stall
Workshop
Factory
Sales
Cinema
MAT0144
hmz/june2016 11
Page 36
Finding quartiles for an odd data set:
Firstly, arrange the data in ascending order
3, 5, 7, 8, 12, 13, 14, 18, 21
lower half upper half
Median = 𝑄2 = 12
𝑄1 = median of lower half =5+7
2= 6
𝑄3 = median of upper half =14+18
2= 16
Finding quartiles for an even data set:
Firstly, arrange the data in ascending order
3, 5, 7, 8, 12, 13, 14, 17, 18, 21
lower half upper half
Median = 𝑄2 =12+13
2= 12.5
𝑄1 = median of lower half = 7
𝑄3 = median of upper half = 17
Page 38 (please change example 12 (b))
Example 12
(a) Find 𝑃33 for Example 10(a)
(b) Find 𝑃60 for Example 10(c)
Solution
MAT0144
hmz/june2016 12
Page 49 (Exercise 1.3 No 5 please use this question instead)
The data below represent the scores of a Placement Test for a group of pre-university students:
SCORE FREQUENCY
𝑓
196.5 – 217.5 5
217.5 – 238.5 17
238.5 – 259.5 22
259.5 – 280.5 48
280.5 – 301.5 22
301.5 – 322.5 6
(a) Find the mean, median and standard deviation.
(b) Compute the Pearson’s Coefficient of Skewness, hence comment on the skewness of the
distribution.
(c) Construct a percentile graph (use a graph paper). Then, find:
i) the number of students who scored 270 and higher.
ii) the percentage of students who obtain 250 to 300 marks.
top related