lecture notes 031 probability and statistics lecture notes 03

39
Lecture notes 03 1 Probability and Statistics Lecture notes 03

Upload: howard-burke

Post on 12-Jan-2016

262 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 1

Probability and Statistics

Lecture notes 03

Page 2: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 2

Lesson Overview

• Types of Data– Qualitative (Categorical)– Quantitative (Numerical):

• Discrete vs. Continuous

• Levels of Measurement: – Nominal, Ordinal, Interval, Ratio

• Data Summary and Presentation– The Stem-and-leaf Diagram– The Frequency Distribution Tables– Histogram– The Box Plot– Time Sequence Plots

Page 3: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 3

Types of Data

Data can be classified as either numeric or nonnumeric. Specific terms are used as follows:

• Qualitative data are nonnumeric. {Poor, Fair, Good, Better, Best}, colors (ignoring any physical causes), and types of material {straw, sticks, bricks} are examples of qualitative data. – Qualitative data is often termed catagorical data.

Some books use the terms individual and variable to reference the objects and characteristics described by a set of data. They also stress the importance of exact definitions of these variables, including what units they are recorded in. The reason the data were collected is also important.

Page 4: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 4

Types of Data

• Quantitative data are numeric.– Quantitative data are further classified as either

discrete or continuous. • Discrete data are numeric data that have a finite number of

possible values.A classic example of discrete data is a finite subset of the counting numbers, {1,2,3,4,5} perhaps corresponding to {Strongly Disagree... Strongly Agree}.

• When data represent counts, they are discrete. An example might be how many students were absent on a given day. Counts are usually considered exact and integer. Consider, however, if three tradies make an absence, then aren't two tardies equal to 0.67 absences?

Page 5: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 5

Quantitative data / Types of Data

– Continuous data have infinite possibilities: 1.4, 1.41, 1.414, 1.4142, 1.141421...The real numbers are continuous with no gaps or interruptions.

– Physically measureable quantities of length, volume, time, mass, etc. are generally considered continuous. At the physical level (microscopically), especially for mass, this may not be true, but for normal life situations is a valid assumption.

The structure and nature of data will greatly affect our choice of analysis method. By structure we are referring to the fact that, for example, the data might be pairs of measurements.

Page 6: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 6

Levels of Measurement

• The experimental (scientific) method depends on physically measuring things.

• The concept of measurement has been developed in conjunction with the concepts of numbers and units of measurement.

• Statisticians categorize measurements according to levels.

• Each level corresponds to how this measurement can be treated mathematically.

Page 7: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 7

Levels of Measurement (Measurement Scales) – Four common types

• Nominal: Nominal data have no order and thus only gives names or labels to various categories.

• Ordinal: Ordinal data have order, but the interval between measurements is not meaningful.

• Interval: Interval data have meaningful intervals between measurements, but there is no true starting point (zero).

• Ratio: Ratio data have the highest level of measurement. Ratios between measurements as well as intervals are meaningful because there is a starting point (zero). (Gender is something you are born with, whereas sex is something you should get a license for.)

Page 8: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 8

Levels of Measurement (measurement Scales) – Four common types

• Nominal scales are for things that are mutually exclusive/non-overlapping, but there is no order or ranking.  For example, professors are divided into departments by subject, but no subject is ranked as better than another.

• Ordinal Levels of Rank are categories that can be ordered, but not precisely.  For example, letter grades, movie quality (excellent, good, adequate, bad, terrible).

• Interval Level ranks the data in precise scales, but there is no meaningful zero.  For example: IQ tests and temperature. Neither have a meaningful zero.

• Ratio Level Data can be ranked and there are precise differences between the ranks, as well as having a meaningful zero.  For example: Height, weight, Salary, and Age.

Page 9: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 9

Types of Data / Levels of Measurement

• Example 1: ColorsTo most people, the colors: black, brown, red, orange, yellow, green, blue, violet, gray, and white are just names of colors.

• To an electronics student familiar with color-coded resistors, this data is in ascending order and thus represents at least ordinal data.

• To a physicist, the colors: red, orange, yellow, green, blue, and violet correspond to specific wavelengths of light and would be an example of ratio data.

Page 10: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 10

Types of Data / Levels of Measurement

• Example 2: TemperaturesWhat level of measurement a temperature is depends on which temperature scale is used.Specific values: 0°C = 32°F = 273.15 K = 491.69°R     100°C = 212°F = 373.15 K = 671.67°R     -17.8°C = 0°F = 255.4 K = 459.67°Rwhere C refers to Celsius; F refers to Fahrenheit; K refers to Kelvin; R refers to Rankine.

• Only Kelvin and Rankine have zeroes (starting point) and ratios can be found. Celsius and Fahrenheit are interval data; certainly order is important and intervals are meaningful. However, a 180° dashboard is not twice as hot as the 90° outside temperature (Fahrenheit assumed)! Although ordinal data should not be used for calculations, it is not uncommon to find averages formed from data collected which represented Strongly Disagree, ..., Strongly Agree! Also, averages of nominal data (zip codes, social security numbers) is rather meaningless!

Page 11: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 11

Data Sources

• Published source

• Designed experiment

• Survey

• Observational study

Page 12: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 12

DATA SUMMARY

Discerete

Data (Variables)

Quantitative

(Numerical)

Qualitative

(Nonnumerical)

Continues Nominal (Categorical)

Ordinal

(Rank ordered categories)

Page 13: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 13

DATA SUMMARY AND PRESENTATION

• The Stem-and-leaf Diagram• The Frequency Tables:

– Standard, Relative, and Cumulative

• Histograms• The Box Plot• Time Sequence Plots

Page 14: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 14

Graphical Displays

• The distribution of a variable describes what values the variable takes and how often each value occurs.

• The frequency of any value of a variable is the number of times that value occurs in the data.

• The relative frequency of any value is the proportion (fraction or percent) of all observations that have that value.

Page 16: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 16

Graphical Displays

• The distribution of a variable describes what values the variable takes and how often each value occurs.

• The frequency of any value of a variable is the number of times that value occurs in the data.

• The relative frequency of any value is the proportion (fraction or percent) of all observations that have that value.

Page 17: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 17

Types of Variables

• Categorical variable: Places an individual into one of several categories. – Examples: Gender, race, political party, zip code

• Quantitative variable: Takes numerical values for which arithmetic operations make sense. – Examples: OYS score, number of vote, cost of

textbooks

Page 18: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 18

Graphs for categorical variables

• Pie charts require relative frequencies since they display percentages and not raw data. The relative frequency of each category corresponds to the percent of the pie that is occupied by that category.

• Bar graphs display data where the categories are on the horizontal axis and the frequencies (or relative frequencies) are on the vertical axis.

Page 19: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 19

Graphs for quantitative variables

Histograms:• The data are divided into classes of equal

width and the number (or percentage) of observations in each class is counted.

• Data scale is on the horizontal axis.• Frequency (or relative frequency) scale is on

the vertical axis.• Bars are draw where base of each bar covers

the class, height of each bar covers the frequency (or relative frequency).

Page 20: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 20

Stem-plots or Stem and Leaf Displays:

• Separate each observation in a stem unit (all but the final rightmost digit of (rounded) data) and a leaf unit (the final digit of (rounded) data).

• Write the stems in a vertical column, smallest to largest from top to bottom.

• Write each leaf in the row to the right of its stem, in increasing order.

Page 21: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 21

Histograms vs. Stem plots

• Both are used to describe the distribution of data.

• Stemplots display actual data values.

• Stemplots are used for small data sets (less than 100 values).

• Histogram can be constructed for larger data sets.

Page 22: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 22

Common Distributional Shapes:• A symmetric distribution is one where both

sides about the center line are approximately mirror images of each other.

• A skewed distribution is one where one side of the center line contains more data than the other. – Skewed to the right: The right side of the

histogram extends much farther than the left side.

– Skewed to the left: The left side of the histogram extends much farther than the right side.

Page 23: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 23

Common Distributional Shapes:

• A bimodal distribution has two humps where much of the data lies.  

• All classes occur with approximately the same frequency in a uniform distribution.

• An outlier in any graph of data is an individual observation that falls outside the overall pattern of the graph.

Page 24: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 24

• THE STEM-AND-LEAF DIAGRAM• A stem-and-leaf diagram is a good way to obtain an

informative visual display of a data set

x1, x2, ..., xn,

where each number xi consists of at least two digits.

• To construct a stem-and-leaf diagram, we divide each number xi into two parts:– a stem, consisting of one or more of the leading digits,

and – a leaf, consisting of the remaining digits.

DATA SUMMARY AND PRESENTATION

Page 25: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 25

• Write the stems in a vertical column, smallest to largest from top to bottom.

• Write each leaf in the row to the right of its stem, in increasing order.

Page 26: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 26

THE STEM-AND-LEAF DIAGRAM

• EXAMPLE• Construct a

stem-and-leaf display for the following data:

105 221 183 186 121 181 180 143

97 154 153 174 120 168 167 141

245 228 174 199 181 158 176 110

163 131 154 115 160 208 158 133

207 180 190 193 194 133 156 123

134 178 76 167 184 135 229 146

218 157 101 171 165 172 158 169

199 151 142 163 145 171 148 158

150 175 149 87 160 237 150 135

196 201 200 176 150 170 118 149

Page 27: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 27

THE STEM-AND-LEAF DIAGRAM

SOLUTION

• We will select as stem values the numbers 7, 8, 9, 10, 11, …, 24.

• The resulting stem-and-leaf diagram is presented in the following figure.

Page 28: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 28

Stem Leaf Frequency

7 6 18 7 19 7 1

10 5 1 211 5 8 0 312 1 0 3 313 4 1 3 5 3 5 614 2 9 5 8 3 1 6 9 815 4 7 1 3 4 0 8 8 6 8 0 8 1216 3 0 7 3 0 5 0 8 7 9 1017 8 5 4 4 1 6 2 1 0 6 1018 0 3 6 1 4 1 0 719 9 6 0 9 3 4 620 7 1 0 8 421 8 122 1 8 9 323 7 124 5 1

THE STEM-AND-LEAF DIAGRAM

Page 29: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 29

Stem Leaf Frequency

24 5 123 7 122 1 8 9 321 8 120 0 1 7 8 419 0 3 4 6 9 9 618 0 0 1 1 3 4 6 717 0 1 1 2 4 4 5 6 6 8 1016 0 0 0 3 3 5 7 7 8 9 1015 0 0 1 1 3 4 4 6 7 8 8 8 1214 1 2 3 5 6 8 9 9 813 1 3 3 4 5 5 612 0 1 3 311 0 5 8 310 1 5 2

9 7 18 7 17 6 1

THE STEM-AND-LEAF DIAGRAM

Stem is sorted in decreasing order, leaf ordered in increasing order

Page 30: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 30

• Inspection of this display immediately reveals that most of the data lie between 110 and 200 and that a central value is somewhere between 150 and 160. Furthermore, the data are distributed approximately symmetrically about the central value.

• The stem-and-leaf diagram enables us to determine quickly some important feature of the data that were not immediately obvious in the original display in original table.

THE STEM-AND-LEAF DIAGRAM

Page 31: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 31

THE FREQUENCY DISTRIBUTION TABLES

• Frequency Tables– Frequency refers to the number of times each

category occurs in the original data

• A frequency table lists in one column the data categories or classes and in another column the corresponding frequencies.

A common way to summarize or present data is with a standard frequency table.

Page 32: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 32

Frequency Tables

• Often, the category column will have continuous data and hence be presented via a range of values. In such a case, terms used to identify the class limits, class boundaries, class widths, and class marks must be well understood.

• Class limits are the largest or smallest numbers which can actually belong to each class. Each class has a lower class limit and an upper class limit.

• Class boundaries are the numbers which separate classes. They are equally spaced halfway between neighboring class limits.

Page 33: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 33

Frequency Tables

• Class marks are the midpoints of the classes. It may be necessary to utilize class marks to find the mean and standard deviation, etc. of data summarized in a frequency table.

• Class width is the difference between two class boundaries (or corresponding class limits).

Page 34: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 34

Frequency Tables

Following are guidelines for constructing frequency tables.

• The classes must be "mutually exclusive"—no element can belong to more than one class.

• Even if the frequency is zero, include each and every class.

• Make all classes the same width. (However, open ended classes may be inevitable.)

• Target between 5 and 20 classes, depending on the range and number of data points.

• Keep the limits as simple and as convenient as possible.

Page 35: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 35

Frequency Tables

• Relative freqency tables contain the relative frequency instead of absolute frequency. Relative frequencies can be expressed either as percentages or their decimal fraction equivalents.

• Cumulative frequency tables contain frequencies which are cumulative for subsequent classes. In a cumulative frequency table, the words less than usually also appear in the left column.

Page 36: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 36

The frequency distribution

• A frequency distribution is a more compact summary of data than a stem-and-leaf diagram.

• To construct a frequency distribution, we must divide the range of the data into intervals, which are usually called class intervals, cells, or bins.

Frequency Tables

Page 37: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 37

• EXAMPLE• Construct the

frequency distribution table for the following data:

105 221 183 186 121 181 180 143

97 154 153 174 120 168 167 141

245 228 174 199 181 158 176 110

163 131 154 115 160 208 158 133

207 180 190 193 194 133 156 123

134 178 76 167 184 135 229 146

218 157 101 171 165 172 158 169

199 151 142 163 145 171 148 158

150 175 149 87 160 237 150 135

196 201 200 176 150 170 118 149

Frequency Distrubion Tables

Page 38: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 38

THE FREQUENCY DISTRIBUTION TABLES

• SOLUTION• Class relative

frequency

• Cumulative

frequency

Class Interval

FrequencyRelative

Frequency (%)Cumulative Frequency

70 to 90 2 2,50 2,50

90 to 110 3 3,75 6,25

110 to 130 6 7,50 13,75

130 to 150 14 17,50 31,25

150 to 170 22 27,50 58,75

170 to 190 17 21,25 80,00

190 to 210 10 12,50 92,50

210 to 230 4 5,00 97,50

230 to 250 2 2,50 100,00

Total 80 100 -

n

nf ii

iii fFF 1

00 F

Page 39: Lecture notes 031 Probability and Statistics Lecture notes 03

Lecture notes 03 39

Frequency Distrubion Tables

Grade Frequency

 9 (freshmen) 29

10 (sophomores) 27

 11 (juniors) 28

 12 (seniors) 27

Another example containing student distributions as follows: