the nature of statistics: the art of learning about and understanding our world through data

42
The Nature of Statistics: The art of learning about and understanding our world through data.

Upload: reynold-bryan

Post on 11-Jan-2016

217 views

Category:

Documents


4 download

TRANSCRIPT

The Nature of Statistics:

The art of learning about and understanding our world through data.

Essentials: The Nature of Statistics(a.k.a: The bare minimum I should take along from this topic.)

• Definitions and relationships as presented on the Anatomy of the Basics: Statistical Terms and Relationships sheet

• Identification of variables and their characteristics

• Careful review of data and their presentation

• Providing a context for the data

• Why use percentages rather than numeric counts when making comparisons

69.2 80 35,000

• What do you know about these numbers?

• What do they mean to you?

• What is missing?

Okay, so What is Statistics?(or is that What ARE Statistics?)

Statistics is the study of how to collect, organize, analyze, interpret and report numerical information in order to make decisions.

Statistics are the numeric data we use to better understand our world. They may take the form of frequencies, means, percentages, variances, etc.

What is a Study?

• 3 Types:• Observational – observe and measure; can

identify association, not causation.

• Experimentation – impose treatment and observe characteristics; can help establish causation.

• Simulation – using computers to simulate situations that are not practical to do in real time.

Basic Terminology • DATA: Are numbers with a context - i.e. numbers with

meaning. – Examples: not 48.2, but 48.2 kg. not 5.23, but 5.23 inches)

• VARIABLE: A characteristic or property of an individual population unit that varies from one person or thing to another.– Examples: age, square footage, and assessed value represent three

variables associated with homes in Oneonta. – Variables have Values. Example: The variable hair color has the

values of brown, blonde, red, etc.• UNIT (Element): Any individual member of the defined

population. – Examples: Each bottle of soda in a production run is a unit; each

penny in a roll of pennies is a unit; each person enrolled in a class is a unit.

Data: One variable (here unidentified, i.e. no context), multiple values

“Raw” Data (N=160) “Organized” raw data (N=160)

Unit

73 “different” numbers

Time Period Otsego Lake was Frozen (days)

Raw Data Grouped Data

Time Period Otsego Lake was Frozen (days)

125-149100-12475-9950-7425-490-24

70

60

50

40

30

20

10

0

Days

Frequency

Otsego Lake: Days Frozen!849-50 to 2009-10

Data: Two Variables: year and days; multiple values

Time Period Otsego Lake was Frozen:Mean Days/Decade

Time Period Otsego Lake was Frozen:Mean Days/Decade

So is the Greenhouse Effect at work here?

To be studied through further statistical analysis, such as the use of ANOVA…

Anatomy of the Basics: Statistical Terms and Relationships

Statistics is the study of how to collect, organize, analyze, interpret and report numerical information.

Descriptive Statistics: methods for organizing and summarizing information. E.g. Number of students in this class by major, baseball standings, housing sales by month.

Inferential Statistics: methods for drawing conclusions and measuring the reliability of those conclusions using sample results. E.g. Political views of all 4-year college students.

Population vs. Sample

Population: all individuals, items, or objects whose characteristics are being studied.

Parameter: numerical characteristic of a population.Census: data collected from ALL members of the population.

Sample: a portion of the population selected for study. Statistic: numerical characteristic of a

sample.

Variable: a characteristic or property of an individual unit. Variables have values.

Qualitative: a variable that cannot be measured numerically E.g. Gender, eye color.

Quantitative: a variable that can be measured numerically. E.g. Income, height, number of siblings one has.

Discrete: a variable whose values are countable. It can only assume certain values, with no intermediate values. E.g. Number of auto accidents in Oneonta in 1998.

Continuous: a variable that can assume any numerical value over an interval or intervals. E.g.Time.

Scaling of Variables(Measurement Levels)

Nominal: grouping individual observations into qualitative categories or classes. E.g. Grouping individuals by whether they are left-handed or right-handed.No Arithmetic Operations: individual

observations can only be categorized. Ordinal: individual observations are assigned a number or “ranking.” There is a sense of “more than,” but you cannot say “how much” more than. E.g. Military ranks.

Arithmetic Operations: individual observations have meaningful numericvalues. Ratio: variables have a true zero point. Can say how

much more. E.g. Weight, height.

Interval: variables have no true zero point. Cannot say how much more. E.g. Temperature ( F or C), IQ scores.

Population Basic Terminology

• POPULATION: – Complete collection of all elements or units (usually people,

objects, transactions, or events) that we are interested in studying.

– In terms of data, a population is the collection of all outcomes, responses, measurement, or counts that are of interest.

– CENSUS: A complete enumeration (or accounting) of the population (i.e. collecting data from every element (or unit) in the population).

– PARAMETER: A numeric value associated with a population. (e.g. - the average height of ALL students in this class, given that the class has been defined as a population)

Sample Basic Terminology

• SAMPLE: Taken from a population a sample is a subset from which information is collected. – Example: 25 cans of corn (sample) randomly obtained from a full days production

(population)

• STATISTIC: A numeric value associated with a sample. – Example: the average height of 10 individuals randomly selected from the class (defined

population).

• INFERENCE: An estimate, prediction, or some other generalization about a population based on information contained in a sample. – Example: Based upon a randomly selected sample of 25 flights at JKF International

Airport (the sample; individual flights are units) taken from all flights on Dec. 24, 2009 (defined population), we can state with a degree of confidence the mean delay for the population of the day’s flights was 35 minutes (sample statistic in context being inferred to the population).

In SummaryTo include ALL units,

you are looking at:• POPULATION• CENSUS• PARAMETERS

To work with a subset of all units, you are looking at:

• SAMPLE• STATISTICS• INFERENCES to a

population

Parameter Population

Statistic Sample

Example: Identifying Data Sets

In a recent survey, 1708 adults in the United States were asked if they think global warming is a problem that requires immediate government action. Nine hundred thirty-nine of the adults said yes.

Describe the data set. Identify:The population:The sample:A variable being studied:Values of the Variable:

Source; Adapted from: Pew Research Center; Larson/Farber 4th ed.

Examples: Populations & Samples

• Smoking: Identify the population and sample.– A survey, 250 college students at Union College were asked if

they smoked cigarettes regularly. Thirty-five of the students said yes. Identify the population and the sample.

• Student Income: Decide whether the numerical value describes a population parameter or a sample statistic.• A survey of 450 Cornell University students reported their average weekly income

from part-time employment was $325.

• For both of the above studies:– What are the units of the population/sample?– Identify a variable being studied.– Identify values of the variable.

Descriptive Statistics:• DESCRIPTIVE STATISTICS: Organize and summarize

information using numerical and graphical methods.– Examples:

• Summarizing the age of cars driven by students in a frequency table.• Graphing the ages of students.• Identifying the mean speed of cars driving in a 30 mph zone.

• A descriptive statement describes some aspect of the data. (Select a statistical measure and put it into sentence format.)

– Examples: • Thirty-eight percent of the orange trees suffered damage due to the cold

temperatures.• The average weight for the 23 cars studied was 2,738 lb.• The mean number of days Otsego Lake was frozen per winter was 88.69 days.

Descriptive Statistics at Work: SUNY Oneonta Car Registrations

Numeric tables, pictures (graphs & charts), and text are three methods used to present data. During the 2006 year there were 1.346 cars registered at SUNY Oneonta. Car registrations contain many variables, such as car type, car color, year of car, and license plate number. Noted below are ways descriptive statistics are used to convey information about the selected variables: a frequency table of Registrant Type (i.e. who registered the car); a graphic presentation of Vehicle Age; and text (written descriptive statement) presenting the mean Vehicle Age, of the registered cars.

Frequency Table: Graphic presentation (here a Histogram):

Mean & Median: The Mean age of cars driven by students was 7.45 years (vs. 6.19 yrs. for employees). The Median age of registered vehicles for students was 7.0 years (5.0 years for employees).

Registrant Type

512 38.0 38.0 38.0

223 16.6 16.6 54.6

13 1.0 1.0 55.6

58 4.3 4.3 59.9

287 21.3 21.3 81.2

253 18.8 18.8 100.0

1346 100.0 100.0

Commuter

Faculty

Management

Other

Resident

Staff

Total

Valid

Frequency Percent Valid PercentCumulative

Percent

Inferential Statistics:

• INFERENTIAL STATISTICS uses sample data to make estimates, decisions, predictions, or other generalizations about the population. – The aim of inferential statistics is to make an inference about a

population, based on a sample (as opposed to a population census), AND to provide a measure of precision for the method used to make the inference.

• An inferential statement uses data from a sample and applies it to a population.

Examples of Inferential Statistics:

• A Gallup Poll found that 57% of dating teens had been out with somebody of another race or ethnic group (+/- 4.5%; 95% CI)– Interpretation: We are 95% confident that between 52.5% and 61.5%

(57% +/- 4/5%) of dating teens have been out with someone of a different race/ethnicity.

• A Gallup Poll found that 40% of Americans would quit their job if they won the lottery (+/- 4%; 95% CI).– Interpretation: We are 95% confident that the true population

proportion of Americans who would quit their job if they were to win a lottery lies between 36% and 44%).

Example: Descriptive and Inferential Statistics

Decide which part of the study represents the descriptive branch of statistics. What conclusions might be drawn from the study using inferential statistics?

A large sample of men, aged 48, was studied for 18 years. For unmarried men, approximately 70% were alive at age 65. For married men, 90% were alive at age 65.

Source: (The Journal of Family Issues)Larson/Farber 4th ed.

Two Types of Data

• Qualitative Data can be separated into different categories (values) that are distinguished by some nonnumeric characteristic. Qualitative data are also referred to as categorical or attribute data. – Examples include gender, eye color, and car brands– Note that the values of this type of variable are

differentiated by words rather than numeric values. Example: Eye Color values include blue, brown, hazel, etc.

Characteristics of DataBefore conducting any data analysis the characteristics of the variable under study must be identified. This will result in utilizing appropriate tables, graphs and statistical analysis.

• Discrete Data - result when the number of possible values is either a finite or a countably infinite number.– Examples: Siblings, Cars, and Coins in a jar (think of whole

number counts here; even if you cannot count them all).

• Continuous Data - result from infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps. Continuous data can assume any value, including fractional parts.– Examples: Height, Weight, Time

N.B.: Qualitative data cannot be classified as discrete or continuous.

• Quantitative Data are “number-based” and represent counts or measurements. This type of data may be subdivided into two categories...

Example: Classifying Data by TypeThe base prices of several vehicles are shown in the table. Which data are qualitative data and which are quantitative data? (Source Ford Motor Company)

Source: Larson/Farber 4th ed.

4 Levels of MeasurementThe level of measurement determines which statistical calculations are meaningful. The four levels of measurement are: nominal, ordinal, interval, and ratio.

Levels of

Measurement

Nominal

Ordinal

Interval

Ratio

Lowest to

highest

Levels of Measurement (cont.)

• Nominal – characterized by data that consist of names, labels, or categories only. The data cannot be arranged in an ordering scheme. Qualitative data.– Examples: Gender, Yes/No, Political Party affiliation,

names of students.

• Ordinal – characterized by data that can be arranged in some order, but the differences between data values either cannot be determined or are meaningless. These variables may be either qualitative (categorical) data or quantitative (numerical) data.– Examples: Military Rank, Position in a race, Attitude scales.

Levels of Measurement (cont.)

• Interval – like the ordinal level, with the additional property that the difference between any two data values is meaningful. However, there is no natural zero starting point. Quantitative data.– Examples: Temperature (F or C); longitude; Calendar

Years.

• Ratio – is the interval level modified to include the natural zero starting point. At this level, differences and ratios are both meaningful. Quantitative data.– Examples: Height, Weight, Time, Age.

Summary of Levels of Measurement

NoNoNoYesNominal

NoNoYesYesOrdinal

NoYesYesYesInterval

YesYesYesYesRatio

Determine if one data value is a

multiple of another

Subtract data values

Arrange data in order

Put data in categories

Level of measurement

Example: Classifying Data by Level

Two data sets are shown. Which data set consists of data at the nominal level? Which data set consists of data at the ordinal level? (Source: Nielsen Media Research)

Source: Larson/Farber 4th ed.

Example: Classifying Data by Level

Two data sets are shown. Which data set consists of data at the interval level? Which data set consists of data at the ratio level? (Source: Major League Baseball)

Source: Larson/Farber 4th ed.

Anatomy of the Basics: Statistical Terms and Relationships

Statistics is the study of how to collect, organize, analyze, interpret and report numerical information.

Descriptive Statistics: methods for organizing and summarizing information. E.g. Number of students in this class by major, baseball standings, housing sales by month.

Inferential Statistics: methods for drawing conclusions and measuring the reliability of those conclusions using sample results. E.g. Political views of all 4-year college students.

Population vs. Sample

Population: all individuals, items, or objects whose characteristics are being studied.

Parameter: numerical characteristic of a population.Census: data collected from ALL members of the population.

Sample: a portion of the population selected for study. Statistic: numerical characteristic of a

sample.

Variable: a characteristic or property of an individual unit. Variables have values.

Qualitative: a variable that cannot be measured numerically E.g. Gender, eye color.

Quantitative: a variable that can be measured numerically. E.g. Income, height, number of siblings one has.

Discrete: a variable whose values are countable. It can only assume certain values, with no intermediate values. E.g. Number of auto accidents in Oneonta in 1998.

Continuous: a variable that can assume any numerical value over an interval or intervals. E.g.Time.

Scaling of Variables(Measurement Levels)

Nominal: grouping individual observations into qualitative categories or classes. E.g. Grouping individuals by whether they are left-handed or right-handed.No Arithmetic Operations: individual

observations can only be categorized. Ordinal: individual observations are assigned a number or “ranking.” There is a sense of “more than,” but you cannot say “how much” more than. E.g. Military ranks.

Arithmetic Operations: individual observations have meaningful numericvalues. Ratio: variables have a true zero point. Can say how

much more. E.g. Weight, height.

Interval: variables have no true zero point. Cannot say how much more. E.g. Temperature ( F or C), IQ scores.

Misuse of Statistics

ah yes… the old torture the data long enough and they will confess to anything routine...

• Precise NumbersTonight’s paid attendance was 56,423

• GuesstimatesIt was estimated that one million spectators lined

the rode to L’Alpe d’Heuz for the 16th stage of the 2004 Tour de France race.

• Distorted PercentagesNew and improved with 50% more ... – 50% might

not be a meaningful amount.

• Partial PicturesFord truck adds

• Loaded QuestionsLine item veto

• Misleading GraphsVisual distortions of data

• PictographsThe crescive cow.

• Pollster PressurePublic bathrooms.

• Small/Bad Samples67% suspended

• Self-Selected SurveysCNN phone-in surveys

Pictograph: “This year my business profits doubled!”

Visual Presentations of Data – Beware

Source: http://findarticles.com

Data Considerations • Anecdotal Evidence – basing our conclusions

on a few individual cases. e.g. We remember the airplane crash that kills several hundred people and fail to notice that data for all flights show that flying is much safer than driving.

• Lurking Variables – almost all relationships between two variables are influenced by other variables lurking in the background.

Airline Flights: Alaska Airlines vs. American West Which would you choose to fly?

  On Time Delayed

Alaska Airlines 3274(86.7%)

501(13.3%)

America West 6438(89.1%)

787(10.9%)

Alaska Airlines vs. American WestA Closer Look

 Departure Location

On Time Delayed On Time

Delayed

Los Angeles 497 62 694 117

Phoenix 221 12 4840 415

San Diego 212 20 383 65

San Francisco 503 102 320 129

Seattle 1841 305 201 61

TOTAL 3274 501 6438 787

Alaska Air America West

We now know that American West has a better “On Time” record, but Alaska Airlines has a better “On Time” record at every airport. How can that be?

 Departure Location

On Time Delayed On Time

Delayed

Los Angeles 497(88.9%)

62(11.1)

694(85.6)

117(14.4)

Phoenix 221(94.8)

12(5.2)

4840(92.1)

415(7.9)

San Diego 212(91.4)

20(8.6)

383(85.5)

65(14.5)

San Francisco 503(83.1)

102(16.9)

320(71.3)

129(28.7)

Seattle 1841(85.8)

305(14.2)

201(76.7)

61(23.3)

TOTAL 3274(86.7)

501(13.3)

6438(89.1)

787(10.9)

Alaska Air America West

End of Slides