i. introduction to data and statistics

I. Introduction to Data and Statistics

A. Basic terms and concepts

Data set

- variable

- observation

- data value

5625786535

8912657825

7889581434

2598341953TX

> 65 $< 19 Rent $age

LA

AL

MS

CentralGulf States

B. Primary and Secondary data

1. Primary data

- original data

- collected for a specific purpose

- sample design and procedures

- time and $

2. Secondary data

- archival data

- agency or organization

- organized in a set format

- time and $

- data quality an issue

- sample design

C. Individual and spatially aggregated data

State 1

State 4State 3

State 2

State 1

State 4State 3

State 2

Region

Region

D. Discreet and Continuous data

1. Discreet

2. Continuous

E. Qualitative and Quantitative data

1. Qualitative (categorical)

Ex: land cover, sex, political party, race

2. Quantitative

Ex: population, precipitation, grades

II. Scales of Measurement

A. Nominal

B. Ordinal

C. Interval

D. Ratiofor comparison must use the same scale of measurement

A. Nominal

Name: George = 1, Wanda = 2, Bob = 3

Land Cover: Forested = 45, urban = 39, etc...

Climate regimes: polar = 1, temperate = 2, tropical = 3

Sex: Male = 1, Female = 2

- Mutually exclusive

- Exhaustive

Ex:

B. Ordinal

- ranked data

- arbitrary

- comparisons

- not a set interval between rankings

Ex:

Places rated (cities, beaches…)

Level of satisfaction (poor, ok, good)

C. Interval

- separated by absolute differences

- does not have an absolute zero

Ex:

- temperature

- elevation

D. Ratio

- separated by absolute differences

- absolute zero

Ex:

- precipitation

- tree growth

- income

III. Graphing procedures (univariate)

A. frequency histogramB. cumulative histogram

1000 50

A. frequency histogram

Freq.

(#, %)

income, grades

(-)

(+)(frequency polygon)

0 50

B. Cumulative frequency histogram

Cumu- lative Freq.

(#, %)

(-)

(+)

100

(cumulative frequency polygon)

IV. Descriptive Statistics (univariate)- summary of data characteristics- inferential; extend sample to a larger population

A. Measures of Central TendencyB. Measures of DispersionC. Measures of Shape

A. Measures of Central Tendency• attempt to define the most typical value of a larger data set

1. Mode2. Median3. Mean (average)

Mode (nominal only)• value that occurs most frequently

• only measure of central tendency appropriate for nominal level data• works better for grouped data, not raw values• many data sets will not have two exact data sets

2. Median• the middle value from a set of ranked observations• equal number of observations on either side• appropriate when data is heavily skewed• interval or ratio level data, not nominal

3. Mean (average), .xi / n• most commonly used value of central tendency• interval or ratio level data• sensitive to outliers• most easily understood• assumptions:

• unimodal• symmetric distribution

(-) (+)

0 100

mode

median

mean

Normal distribution

50

(-) (+)

0 10050

mode

median

mean

B. Measures of Dispersion• provide information about distribution of data

1. Range2. Standard deviation3. Coefficient of variation

1. Rangedifference between largest and smallest value

• simplest measure of dispersion• easy to calculate• can be misleading

• ignores all other values• does not take into account clustering of data

2. Standard deviation• the average deviation of each value from the mean

• based on the mean• better indicator of the dispersion of the entire sample (in comparison to the range)• scale dependent value

3. Coefficient of variation• standard deviation / mean

• allows you to compare dispersion independent of scale• should be used to make comparisons where there are differences in mean

(-) (+)

15 8550

Range: 85 - 15 = 70

1000

Std. dev. ~ .xi - X

X = 50

C.V. = Std. dev. / mean

C.V. = Std. dev. / mean

C. Measures of Shape

1. Skewness2. Kurtosis

Leptokurtic

Mesokurtic

Platykurtic

(-) skew(+) skewSymmetrical

(bell shaped)

I.D. Xi Yi

A 2.8 1.5B 1.6 3.8C 3.5 3.3D 4.4 2.0E 4.3 1.1F 5.2 2.4G 4.9 3.5

Mean Center

0 6

4 B (1.6, 3.8)

A (2.8, 1.5)

C (3.5, 3.3)

D (4.4, 2.0)

E (4.3, 1.1)

G (4.9, 3.5)

F (5.2, 2.4)

54321

1

2

3

0 6

B (1.6, 3.8)

A (2.8, 1.5)

C (3.5, 3.3)

D (4.4, 2.0)

E (4.3, 1.1)

G (4.9, 3.5)

F (5.2, 2.4)Mean Center (3.81, 2.51)

54321

1

2

3

4

I.D. Xi Yi f (w)

A 2.8 1.5 5B 1.6 3.8 20C 3.5 3.3 8D 4.4 2.0 4E 4.3 1.1 6F 5.2 2.4 5G 4.9 3.5 3

Weighted Mean Center

0 6

B (20)

A (5)

C (8)

D (4)

E (6)

G (3)

F (5)

54321

1

2

3

4

I.D. Xi Yi f (w) w Xi wYi

A 2.8 1.5 5 14 7.5B 1.6 3.8 20 32 76C 3.5 3.3 8 28 26.4D 4.4 2.0 4 17.6 8.0E 4.3 1.1 6 25.8 6.6F 5.2 2.4 5 26 12G 4.9 3.5 3 14.7 10.5

0 6

B (20)

A (5)

C (8)

D (4)

E (6)

G (3)

F (5)

54321

1

2

3

4

Weighted MeanCenter (3.10, 2.88)

Correlation

1. Directionnegative or positive

2. Strength of relationshipperfect, strong, weak, no

- Bivariate relationship

Scattergrams

(-) (+)

(+)

Positive (direct) correlation

(-) (+)

(+)

Negative (inverse) correlation

(-) (+)

(+)

Perfect correlation

(-) (+)

(+)

Strong correlation

(-) (+)

(+)

Weak correlation

(-) (+)

(+)

No correlation ??

(-) (+)

(+)

Controlled Correlation

(-) (+)

(+)

Controlled correlation (clumping)

(-) (+)

(+)

(-) (+)

(+)

Threshold

(-) (+)

(+)

Curvilinear

i. introduction to data and statistics

Documents

primary data original

grouped data

data quality

data arbitrary

nominal level data

larger data set1

observation data valueb

exact data sets2