statistics [0,i/2] the essential mathematics. two forms of statistics descriptive statistics what is...

Post on 18-Dec-2015

220 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Statistics [0,I/2]The Essential Mathematics

Two Forms of Statistics

•Descriptive Statistics

•What is physically happening within the data?

•Inferential Statistics

•What can I glean from a sample that is pertinent to the population?

Descriptive Statistics

•Measures of Center

•mean, median, mode

•Measures of Spread

•variance, standard deviation, range, IQR, outliers

•Measures of Shape

•kurtosis, skewness

Descriptive Statistics

•Measures of Center

•mean, median, mode

•Measures of Spread

•variance, standard deviation, range, IQR, outliers

•Measures of Shape

•kurtosis, skewness

Exploratory

Analysis

Measures of Center

•The expectation of what should happen in a given situation at random

•Traditionally, we see that as the mean, but that can also be the median or the mode in certain contexts

Situation

•You are interested in the body mass of full grown adults from one gender.

•If you were to find one person from that gender at random, what would you expect that person to weigh?

Mean

•Four types of means

•Arithmetic mean (typical interpretation)

•Geometric mean

•Harmonic mean (most conservative)

•Quadratic mean (pooling operation)

Arithmetic Mean

•Unbiased estimator for the population mean

•When should I be concerned with the mean?

•Data should be symmetric

•equally likely to see something relatively large as I am relatively small

•Typically, the first thing to look at

Arithmetic Mean

•Add them up, divide by the number of them

Symmetric without a picture?

•Line the data up from worst to first (maximum to minimum)

•Find the one in the middle

•Subtract the minimum from the middle and subtract the middle from the maximum

•Are those two values equal?

•Skewness (we’ll see that later)

Situation•You are interested in the economic

conditions of a country (say the United States).

•If you were to select a household at random from the United States, how much money do you expect that household makes?

Median

•The exact middle observation of a set of data

•This is the mean when a set is symmetric

•When a set is asymmetric, these are different

•Not responsive to questionable influences

•The stoic of statistics

Median or Mean?

•Find the mean and the median

•How close are they?

•If they are “close”, use the mean

•If they are not close, typically use the median (this indicates skew)

Situation•You are an artificial intelligence

programmer and are interested in how to assign algorithms for random occurrences in a football game that result in scores.

•What is the expected score that happens on that play?

Weird Scenario...• Football has a few ways of scoring, but we know what

the set is going to be composed of:

• Touchdown (typical): 7

• Touchdown (2 pt. conversion): 8

• Touchdown (failed conversion): 6

• Field Goal: 3

• Safety: 2

• The “average score” on a play in football is probably somewhere between 4.5 and 5

• We should, however, expect the score to be either 3 or 7

Mode

•The mode is the most common observation in a dataset

•Sparingly used, but can be important

•If observations recur, why is that happening?

Questions:•Which of the three makes sense based on my

understanding of what should happen?

• Should this data be inherently symmetric?

• Should this data be pulled one way or the other?

• Should this data be predisposed to particular values?

• Answer these questions before you see it!

Measures of Spread

•What is the variation found within my data?

•Many different ways of looking at this (based on your choice of mean or median):

•Standard deviation/variance for mean

•Range/IQR for median

Variance•Otherwise known as “residual error”

•Find the mean

•Take each observation and subtract the mean from it

•Square each value

•Add them up

•Divide by n-1

Variance• If a set is “tight” to its mean, its variance will be low

(we will call this leptokurtic later)

• If a set is “broad” to its mean, its variance will be higher (we will call this platykurtic later)

• Remember: the larger a residual, the higher the impact of squaring it is

• 52 = 25; 102 = 100, a factor of 4 when the residual doubled

Why square it?• If we didn’t, variance would always be 0,

rendering the statistic meaningless!

• Why?

• Variance allows us to see spread by making negative values positive and then adding more weight to something more distant (both effects of squaring)

Why n-1?•Degrees of freedom

•Makes us more conservative

•Dividing by larger numbers reduces values; dividing by smaller numbers assumes wider

•We don’t have everything, so tend to conservative

Standard Deviation• Undoes the squaring procedure

• Gives us the “average” distance between an observation and the mean

• If variance is high, standard deviation will be high; if low, standard deviation will be low

• Great metric for “how far” questions as it normalizes observations

Range and IQR• In the case of the median, percentile

observations are the focus

•Minimum, maximum

•25%, 75%

•Range = maximum - minimum

• IQR = 75% - 25%

• IQR defines outliers

Skewness

•Is a distribution symmetric or biased?

•The signum of skewness is the relationship between the mean and the median

•Mean > median --> positive skew

•Mean < median --> negative skew

Reasons for left skew

•A test or task were too easy

•Ever taken an exam where nearly everyone got a great grade, but someone struggled? That’s left skew...

Reasons for right skew

•A variable naturally has a left bound

•Time-based data

•Economics

Right tail transform

•Right tail skews are typically transformed using logarithms or square roots

•Why?

Kurtosis

•Is data predisposed to a particular central occurrence?

•Can’t be less than 1 (-2 excess)

•Positive values of kurtosis reflect high peaks (predisposition)

•Negative values of kurtosis reflect flatter peaks

Assignment• You will be provided a dataset that comes from a

questionnaire about ecological values (New Ecological Paradigm).

• You will be shown all of the values mentioned in this slide set and a bar graph of the responses

• Determine the appropriate measure of central tendency.

• Determine whether or not you feel there are effects such as biasing or predisposition occurring.

• Remember: gut instincts...do not do any tests!

NEP

•For your reference:

•High values on odd questions favor human endeavors

•High values on even questions favor the environment

top related