problem set #1 - university of denverurizen-geography.nsm.du.edu/~psutton/sutton_courses/geog... ·...

Problem Set #2Geog 2000: Introduction to Geographic Statistics

Instructor: Dr. Paul C. Sutton

Due Tuesday Week #5 in LabInstructions: There are three parts to this problem set: 1) Do it by Hand Exercises (#’s 1 – 10). Do these with a calculator or by hand. Draw most of your figures by hand. Write out your answers either digitally or by hand. Start your answer to each numbered question on a new page. Make it easy for me to grade. 2) Computer exercises (#’s 11-13). Use the computer to generate the graphics and paste them into a digital version of this assignment while leaving room for you to digitally or hand write your responses if you so wish. 3) ‘How to Lie with Statistics’ Essay(s) (#’s 14-18). For this section, prepare a typed paragraph answering each question. Do not send me digital versions of your answers. I want paper copies because that is the easiest way for me to spill coffee on your assignments when I am grading them. All of the questions on these four problem sets will resemble many of the questions on the three exams. Consequently, it behooves you to truly understand – ON YOUR OWN – how to do these problems. I am fine with you working with other students on these problem sets (to enhance your learning opportunities); however, my exams will punish those who free ride the comprehension of the material in these problem sets. (Don’t mess with me; don’t cheat yourself).

Pattern Recognition: Is this an old ‘cold war’ era satellite image or; a, ‘down on the farm’ photograph of Farmer John’s Best friend?

Do it By Hand Exercises (i.e. Don’t Use A Computer)

#1) The Standard Normal Distribution in all its glory

Make a detailed drawing of a Standard Normal (e.g. N(0,1)) probability density function (aka pdf), and answer the following questions:

A) What fraction of observations drawn from a N(0,1) distribution are larger than 1 standard deviation above the mean?

13.6 + 2.1 + 0.1 = 15.8%

B) What is the mathematical equation describing the N(0,1) pdf?

F(x) = (1 / (2*п)1/2 ) * (2.718281828)(-x2 / 2)

Note: It looks nicer in the figure above

C) What is the total area under the N(0, 1) curve? (e.g. ) where P(x) is the equation you identified in part ‘B’?

The total area under the standard normal curve from -∞ to +∞ is simply 1.

D) What fraction of observations drawn from a N(0,1) distribution would you expect to land anywhere from -1 to +1? (e.g. -1 <= X <= +1)

34.1 + 34.1 = 68.2%

E) What fraction of observations drawn from a N(0,1) distribution would you expect to land anywhere from -2 to +2? (e.g. -2 <= X <= +2); -3 to +3? (e.g. -3 <= X <= +3)

-2 to +2: 34.1 + 34.1 + 13.6 + 13.6 = 95.4%

-3 to + 3: 34.1 + 34.1 + 13.6 + 13.6 + 2.1 + 2.1 = 99.6%

F) Assume women’s heights are distributed N(5’ 7”, 3”). A particular woman has a height of 5’ 10”. What percentage of women are taller than this woman?

Same as Answer to ‘A’ above:

13.6 + 2.1 + 0.1 = 15.8%

G) Suppose there are 1,000 trout in Lake Goober. Assume the lengths of these trout are distributed N(10”, 2”). How many fish would you expect that lake to have that are anywhere from 11” to 13” long?

The expected value for the number of fishbetween 11” and 13” inches long is: 242

The total area under the standard normal curve from +0.5 to +1.5is 0.3085 – 0.0668 = 0.2417 or 24.2% 24.2% of 1,000 is 242 fish

#2) Measures of variability and Z-scores oh my!

Given the following observations of a random experiment calculate the following:

3, 4, 4, 6, 7, 8, 8, 8, 9, 10, 10, 12, 21, 28, 35, 52

A) The Mean 14.0625 B) The Median 8.5

C) The Standard Deviation 13.5 D) The Variance 182.2

E) The inter-quartile range 12 or 12.5 F) The Mode 8

H) Draw a histogram of these data and comment on the pattern it reveals.

0 10 20 30 40 50

100.0%99.5%97.5%90.0%75.0%50.0%25.0%10.0%2.5%0.5%0.0%

maximum

quartilemedianquartile

minimum

52.00052.00052.00040.10018.7508.5006.2503.7003.0003.0003.000

QuantilesMeanStd DevStd Err Meanupper 95% Meanlower 95% MeanN

14.062513.4979943.374498421.2550736.8699269

16

MomentsColumn 1

Distributions

This is clearly a skewed distribution with a longer tail toward high values

I) How is the standard deviation different than the Mean Absolute Deviation? Define both the standard deviation and the mean absolute deviation and calculate each of them for this set of numbers.

The Standard Deviation ‘S’ is given by:

In this dataset ‘s’ = 13.5The Mean Absolute Deviation is given by

In this dataset MAD = 9.96875

J) Calculate a z-score (e.g. a normalized value for each of the 16 numbers). Does it make sense to use z-scores for this particular distribution? Explain

X 3 4 4 6 7 8 8 8 9 10 10 12 21 28 35 52Z-

sc0re-

0.82-

0.75-

0.75 -0.6-

0.52-

0.45-

0.45-

0.45-

0.38 -0.3 -0.3-

0.15 0.514 1.033 1.551 2.811

If one is calculating z-scores under the assumption that this dataset is distributed normally it is not a good idea. This dataset is way too skewed to regard as normal.

K) Suggest a random experimental observation that might produce a set of numbers like this?

A lot of geographic data has skewed distributions like this. For example this could be the windspeeds at a particular location on March 9th for 16 different years. It could also be the number of dollars found in 16 randomly selected people’s pockets.

#3) Measurement Scales: “These Tennis Balls are MUCH Fuzzier”.

Define the four measurement scales most commonly encountered in geographic measurement (listed below as ‘A’, ‘B’, ‘C’, and ‘D’). Provide an example of the observations of a ‘random process’ for each of these:

A) Nominal ‘Apples and Oranges’. The values of this scale have no ‘numeric’ meaning in the way that we usually think about number. Land cover classes, Religion, Language, Sex, etc.B) Ordinal‘Ranking’ Sears ‘Good’ , ‘Better’, ‘Best’ characterization of their appliances. Likert scale survey responses such as ‘Strongly Agree’, ‘Agree’, ‘Neutral’ etc. The measurments can be ranked in an order from highest to lowest but the difference between them is not comparable (e.g. ‘Better’ is not necessarily as far from ‘Good’ as ‘Best’ is from ‘Better’). In essence, the ‘distance’ between numbers is not meaningful.C) IntervalWith both interval and ratio data the distance between numbers is meaningful. For example; 90˚F is 10 degrees hotter than 80˚ F and that 10˚ difference is the same as the 10˚ difference between 30˚ degrees and 20˚ degrees. However, for interval data the zero is not meaningful. Degrees Fahrenheit being the most commonly used example of interval data. 80˚ F is NOT twice as ‘hot’ as 40˚ F nor is 20˚ F twice as hot as 10˚F . If you measure in degrees Kelvin – Temperature is an absolute measure of the energy in matter and temperature ratios can be used reasonably. D) RatioWith ratio scale measurements the zero is meaningful and making ratios of measurements is appropriate. Length, Weight, Counts (e.g. populations) are all ratio measures.

E) Classify each of the following according to their measurement scale:

1) Student course Evaluation Scores here at DU. (i.e. Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree)

Likert Scale responses are definitely ‘ORDINAL’; however, many statisticians perform analyses on survey data using statistical methods that assume the responses are interval. This is considered problematic by some but is a common practice. For example, if survey responses ‘Strongly Agree’….”Strongly Disagree’ are coded as 5 …. 1 (Which they almost universally ARE comparing means (or averages) of these number is not truly kosher statistically if they are ordinal data. Nonetheless – Most Universities look at averages of these kinds of numbers when evaluating teaching effectiveness.

2) Sex (e.g. Male or Female)

Sex is a pretty heavy duty topic but classifying people as male or female is relatively simple (not always – but most of the time). This kind of classification is definitely Nominal.

3) Number of text messages sent from randomly selected cell phones

This is clearly a count. 10 messages is twice as many as 5 and zero is zero. This kind of measurement is clearly Ratio.

4) SAT scores of the graduating high school students of the year 2000

This is the toughest of these four to classify its measurement scale. It is impossible to even get a zero on an SAT test. The lowest possible score is 200. One question is: Is a 300 as much better than a 200 as a 700 is better than a 600. Another question is: Is an 800 twice as good as a 400? (answer: probably not – in fact – an 800 is probably more than twice as good as a 400). SAT scores are definitely ordinal – and they may be interval - but they are certainly not ratio.

#4) Tumbling Dice

Use the example of the rolling of two six sided dice to define the ideas of:

A) A Random Experiment

From the book: A random experiment is the process of observing the outcome of a chance event. Rolling two die and summing their face up values is not only a random experiment but it produces a value or ‘score’ which is a random variable which is the numerical outcome of a random experiment.

B) Elementary Outcomes

The elementary outcomes of this random experiment are all the possible combinations of the two die:1,1 1,2 1,3 1,4 1,5 1,6 2,1, 2,2 2,3 2,4 2,5 2,6 3,1 3,2 3,3, 3,4 3,5 3,6 4,1 4,2 4,3 4,4 4,5 4,6 5,1 5,2 5,3 5,4 5,5 5,6 6,1 6,2 6,3 6,4 6,5 6,6Basically there are 36 ways for two dice to ‘result’ in a ‘score’.Note However that if you define the ‘score’ as the ‘sum’ of the two die there are only eleven possible outcomes (e.g. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) and they do not all have the same probability.

C) Sample Space

From the book: The sample space is the set or collection of all the elementary outcomes. In this case it has been spelled out clearly above.

D) Draw a probability mass function for the random experiment of rolling two dice and summing their face up numbers

Score Prob Dec Eqv2 1 in 36 0.027773 2 in 36 0.05554 3 in 36 0.083335 4 in 36 0.11116 5 in 36 0.138887 6 in 36 0.16668 5 in 36 0.138889 4 in 36 0.111110 3 in 36 0.0833311 2 in 36 0.055512 1 in 36 0.02777

E) What is the the probability of at least one of the die showing a ‘6’ on its face in a single roll of two dice?

The easy but tedious way to answer this question is to go to the sample space and count the number of the 36 ‘events’ that have a ‘6’ in them. Based on that the odds (aka probability) of having at least one ‘6’ in a roll of two die is 11/36 (0.30555). From the book we might see it as one of the probability rules: P(E or F) = P(E) + P(F) – P(E & F).In this case P(‘6’ on die ‘a’ OR ‘6’ on die ‘b’) = 1/6 +1/6 – 1/36 = 11/36. Get it?

#5) Sometimes extreme events are very important & the Normal Distribution is not.

Consider the following numbers observed by a hypothetical hydrologist for a stream he has been observing for 3 years. The numbers represent the maximum flow of water (in cubic meters per minute), at a given point at the lower end of the stream right after a rainstorm.

1.0, 1.0, 1.5. 1.6, 1.8, 2.0, 2.1, 2.1, 2.1, 2.5, 2.5, 2.6, 2.7, 2.7, 2.7, 3.0, 3.3, 3.5, 3.5, 3.6, 3.6, 3.7, 3.7, 3.8, 3.9, 4.1, 4.3, 4.9, 5.5, 6.3, 7.1, 7.5, 8.8, 10.3, 16.5, 59.6

A) Draw a histogram of these numbers (use JMP if you wish). Do these observations look normally distributed? How might the Normal distribution not be an appropriate model for these observations?

0 10 20 30 40 50 60

100.0%99.5%97.5%90.0%75.0%50.0%25.0%10.0%2.5%0.5%0.0%

maximum

quartilemedianquartile

minimum

59.60059.60059.6009.2504.7503.5002.2001.5701.0001.0001.000

QuantilesMeanStd DevStd Err Meanupper 95% Meanlower 95% MeanN

5.59444449.72901241.62150218.88626872.3026202

36

MomentsColumn 1

Distributions

Clearly, this distribution is very skewed. Does not look normal at all.

C) Suppose that sediment transport (e.g. kilograms of soil, gravel etc.) that are moved by these storms varies directly as a function of the flow of water raised to the 3rd power (e.g. a cubic function (i.e. Flow = 1 , sediment transport = 1; flow = 2, sediment transport = 8; flow = 3, sediment transport = 27). Calculate the predicted sediment transport for each of the storms and calculate what percentage of the sediment transport that took place over these three years was accomplished by the single largest storm. Paste a table in here (you’ll probably want to use excel or JMP)

Stream Flow 1 1 1.5 1.6 1.8 2 2.1 2.1 2.1 2.5 2.5 2.6Sed Flow (x3) 1 1 3.375 4.096 5.832 8 9.261 9.261 9.261 15.625 15.625 17.576

Stream Flow 2.7 2.7 2.7 3 3.3 3.5 3.5 3.6 3.6 3.7 3.7 3.8Sed Flow (x3) 19.68 19.68 19.68 27 35.94 42.875 42.88 46.66 46.66 50.653 50.653 54.872

Stream Flow 3.9 4.1 4.3 4.9 5.5 6.3 7.1 7.5 8.8 10.3 16.5 59.6Sed Flow (x3) 59.32 68.92 79.51 117.6 166.4 250.05 357.9 421.9 681.5 1092.7 4492.1 211708.7

Total Sediment moved: 220,053.8 Biggest Stream Flow: 211,708

211,708/220,053 = .962 96.2% From One Storm

D) Assume these are reasonable numbers (as observations) and the stated relationship between stream flow and sediment transport is also correct. How important are extreme events? How well do you think the Normal distribution will serve as a model for these kind of hydrologic phenomena?

In this example a single event accounted for 96.2% of the sediment flow. This event also accounted for almost 30% of total water flow (29.59% actually). Clearly modeling this with a normal distribution is not such a good idea. Beware of this because many many many geographic phenomena are not well characterized by the normal distribution.

#6) Shooting Craps - The odds are not that easy to calculate….

Google the rules for shooting craps. The gist is this (I think; – roll a ‘7’ or an ‘11’ on the first roll – you win. If you roll ‘snake-eyes’ (a one and a one) you lose. I think you also lose if you roll a 12). If you roll something other than a 2,7,11, or 12 than you keep rolling until you match what you rolled or something else happens – I don’t really know the rules.

A) In any case – summarize the rules of ‘craps’ right here. (I hope to learn something from this).

Thanks to Wikipedia – Again: Craps is a game played by 1 or more players. Players take turns rolling two dice. The player rolling the dice is called the "shooter." The game is played in rounds, with the first roll of a new round called the "come-out roll."On the come-out roll, if the total of the two dice is 7, 11, 2, 3 or 12, the round ends immediately and the shooter must roll another come-out roll. A result of 2, 3 or 12 is called 'craps' while a result of 7 or 11 is called a 'win' or a 'natural.' When any other number (4, 5, 6, 8, 9, or 10) is rolled on the come-out roll, this number becomes what is called the point. If a point is established, then the shooter will re-roll the dice continuously until either a 7 is rolled or the point is rolled again. If the shooter rolls the point again, the round ends and the game starts over with the same shooter rolling another come-out roll. If the shooter rolls a 7 instead of the point, this is called a 'seven-out,' the round ends, and the dice pass to the next player to the left, who becomes the new shooter.

B) What is the probability of ‘winning’ on your first roll? (e.g. a ‘7’ or an ‘11’)

Rolling a ‘7’ is mutually exclusive from rolling an ‘11’. Consequently we can simply add their respective probabilities. P(‘7’) = 1/6; P(‘11’) = 1/18. Thus the probability of winning on the ‘come-out’ roll is simply: 1/6 + 1/18 = 2/9 (.2222..)

C) What is the probability of ‘losing’ on your first roll?

Again, the probability of rolling a 2,3, or 12 are all mutually exclusive so we can simply add their respective probabilities. P(‘2’) = 1/36; P(‘3’) = 2/36; P(‘12’) = 1/36. Thus the probability of losing on ‘come-out’ roll is: 1/36 + 2/36 + 1/36 = 1/9 (.1111..)

D) Why is it difficult to calculate the ‘overall’ odds of winning at craps?

While it is easy to calculate the odds (probability) of winning or losing on the ‘come-out’ roll (which is in fact 2/9 + 1/9 = 1/3 of the time the first roll is the last roll of a round); it is not as easy to calculate the odds when a ‘point’ is rolled. Suppose your ‘come-out’ roll is a ‘5’. Theoretically you could indefinitely roll ‘not 5’s’ and ‘not 7’s’ forever. It’s kind of like baseball that way . Solving this kind of problem is still classical probability and the numbers are calculable (real mathematicians would regard these problems as trivial); nonetheless they are beyond the scope of this instructor and this class .

http://en.wikipedia.org/wiki/Dice

#7) A Tale of two distributions

Use the normal and the binomial distribution to explain the difference between a discrete random variable and a continuous random variable. What are the parameters of these two distributions? Why does one have a probability distribution function whereas the other has a probability mass function? Describe the relationship between these two distributions? (Larry Gonick’s ‘Cartoon Guide to Statistics’ Should REALLY help ).

The Normal Distribution is continuous and has two parameters the mean (μ) and variance (σ2 ). The mean describes the central tendency and the variance describes the spread or variability about that central tendency. We often designate this as N(μ, σ2 ). The pdf of a normal distribution is a symmetric bell shaped curve. It is a pdf (probability density function) because the normal is continuous. Random variables measured from something that is distributed normally can take on an infinite number of possible values. The Binomial on the other hand is a discrete distribution. Consequently it has a probability mass function (pmf) which essentially describes the probability of each of the finite number of possible outcomes of a Binomially distributed random variable. The number of possible outcomes (or sample space ) is N plus 1; where ‘N’ is one of the parameters of the binomial distribution. ‘N’ can be thought of as the ‘# of coin flips’ or ‘# of Bernoulli trials’. The other parameter of the Binomial is ‘p’ or the ‘probability of success’ or, in the case of a coin, the probability of ‘heads’. Thus a binomial random variable is generally designated as Bin(N, p). As ‘N’ approaches very large numbers and ‘p’ is 0.5 the pdf of a binomial starts to look very similar to a normal distribution. The Normal and Binomial are two very important and fundamental distributions that are distinct in that they are continuous and discrete respectively; however, they are fundamentally related to one another.

2,000 samples from Binomial (100, .5) 2,000 samples from N(50, 5)

The above two figures were derived from random generations from a Binomial and a Normal. As you can see there is a striking similarity (and that’s with N=100).

#8) NASCAR, fishing derbies, and Random Variables

Suppose there is a big fishing derby for NASCAR fans in Kentucky. Lake Goober is stocked with trout whose lengths are distributed N(10, 3). Assume the IQ of the fishing derby participants is distributed N(80, 10). Let’s say we create a random variable that is simply the sum of the IQ of the fisherman and the length of the first fish he or she catches. How will that random variable be distributed?

First, let’s assume that the IQ of the fisherman and the length of the first fish he or she catches are independent (i.e. smarter folks don’t tend to catch bigger fish). If you have carefully read pages 68-72 of The Cartoon Guide you would know that when we add two independent random variables we can simply add their means and add their variances; so, the answer is simply:

N(80, 10) + N(10, 3) = N(90, 13)

#9) Testing for rare diseases and the False Positive Paradox (not a true paradox by the way)

Suppose a rare disease infects one out of every 3,000 people in a population. And suppose that there is a good, but not perfect, test for this disease. If a person has the disease the test comes back positive 99 % of the time. On the other hand, the test also produces some false positives. About 1 % of the uninfected patients also test positive. Assume you just tested positive.

A) What are your chances of having this disease?

Let’s say one million people get tested. Presumably you can expect there to be (1/3000)*1,000,000 = 333.33 people who actually have the disease. 99% of these would test positive: 0.99 * 333.33 = 330. Thus out of a million 330 with the disease would test positive. Now, how many without the disease would test positive? Well, there would be 1,000,000 – 333.33 = 999,666.67 ‘people’ will not have the disease. 1% of these will test positive. This would be: 0.01 * 999,666.67 = 9,996.67. So the number of people who tested positive is: 9,996.67 + 333.33 = 10,330. The percentage of those who tested positive that actually have the disease is 100* (333.33 / 10,330) = 32.3%. So, even if you tested positive there is only a 32.3 % chance that you really have the disease.

C) Also, assume all administrations of this test are independent of one another regardless of who is tested. What should you do if you test positive?

Get tested again. If you test positive again start to worry but get tested again. This kind of test with ‘false positive’ possibilities is problematic.

#10) Pascal’s Triangle, Permutations, Combinations, and the Binomial Distribution

Draw the first few lines of Pascal’s triangle and explain the rule for drawing additional rows indefinitely. How is this triangle related to the Binomial distribution? What is the binomial distribution again (include formulas)? Define, and explain applications, of the formulas for both permutations P(n,r) and combinations C(n,r) shown below:

P(n,r) is the number of ‘permutations’ of ‘n’ things taken ‘r’ at a time. For permutations order matters. C(n,r) is the number of Combinations of ‘n’ things taken ‘r’ at a time. Here order does not matter. Think of the letters A, B, C, and D. How many permutations of two letters from these four are possible? That would be P(4, 2). The formula says it is: 4*3*2*1 = 12 Permutations (2*1)(e.g. AB, BA, AC, CA, AD, DA, BC, CB, BD, DB, CD, DC)

How many combinations are possible?The next line is 1, 6, 15, 20, 15, 6, 1 – The rule The formula says: 4*3*2*1 = 6 combinationsIs That you add the two numbers above it (and 2*1*(2*1)keep 1’s on the edges).

(e.g. AB, AC, AD, BC, BD, CD)

Wikipedia again – In mathematics, the binomial theorem is an important formula giving the expansion of powers of sums. Its simplest version says

whenever n is any non-negative integer, the number

is the binomial coefficient (using the choose function), and n! denotes the factorial of n.

Binomial Probability Formula:

Where p is the probability of a success, and q is the probability of a failure (which is complimentary to p, that is q=1-p.) This is simply a rewording of the binomial theorem. That is N is N factorial and K is K factorial*(N - K)

http://en.wikipedia.org/wiki/Binomial_theorem

http://en.wikipedia.org/wiki/Factorial

http://en.wikipedia.org/wiki/Choose_function

http://en.wikipedia.org/wiki/Binomial_coefficient

http://en.wikipedia.org/wiki/Integer

http://en.wikipedia.org/wiki/Negative_and_non-negative_numbers

http://en.wikipedia.org/wiki/Sum

http://en.wikipedia.org/wiki/Exponentiation

http://en.wikipedia.org/wiki/Formula

http://en.wikipedia.org/wiki/Mathematics

Computer Problems (use JMP and/or Excel for these exercises)

#11) Going from Univariate to Bivariate Characterization of Random Manifestations of distributions: Point patterns in ‘Spaaaaace’.

YOW! This is fun.As you did in problem set one, create the following ‘random’ generations of 2000 values for each of the following distributions: N(100, 3), N(100, 30), N(100, 30) (again – two different instantiations), Exponential (1), Uniform (100), Uniform (100) (again – two different instantiations), and Gamma (7).

Now, you will use the ‘scatterplot’ functionality of JMP. Go to the ‘Analysis’ menu and choose ‘Fit Y by X’. Here you can produce scatterplots of any two of your univariate random columns. The dialog box looks like the figure below.

Once you have selected an ‘x’ and a ‘y’ you click ‘OK’ and get a scatterplot that in this case looks like the figure on the right. In this case a Uniform(100) distribution was plotted against an Exponential (1) distribution. Suppose this is a ‘map’ of a geographically observed phenomena. What random observations might show such a pattern? Here is an example: Presume that the ‘X’ axis is the north side of an interstate highway and the black dots represent cigarette butts found along the side of the highway. Most are close to the highway and they appear less and less frequently as an exponential function of distance from the road. They are uniformly distributed along the highway because (in theory) people will

0

100

200N

(100

, 30)

'A'

0 100 200

N(100, 30) 'B'

Bivariate Fit of N(100, 30) 'A' By N(100, 30) 'B'

0

100

200

N(1

00, 3

0) 'B

'

0 1 2 3 4 5 6 7 8 9

Exponential (1)

Bivariate Fit of N(100, 30) 'B' By Exponential (1)

toss butts out the window at equal probability anywhere along the road (of course we are assuming we are a long way from exits, gas stations, scenic overlooks, etc. In fact, this is a perfectly reasonable ‘model’ of cigarette butt distribution from a transportation corridor. Hopefully this improves your idea of what many people are talking about when they say something along the lines of: ‘We modeled this geographically manifested phenomena according to these assumptions and distribution functions’. Not really as scary as it sounds eh? Now your problems are to create a bunch of scatterplots based on the distributions you have ‘randomly’ sampled from. For each scatterplot suggest a phenomena (geographic or not) that might produce such a pattern. (Note: Paste your plot into your submitted answer set along with your comments and interpretations).

A) Plot N(100, 30) vs. N(100,30). Describe the pattern you see and suggest a random phenomena that might produce such a pattern. Explain.

This is a ‘shotgun’ pattern with the densest concentrarion of points at the center point (100, 100). Possible phenomena that might produce such a pattern – 1) Dead cows around a poisoned water well? 2) A scatter plot of IQ and height of Americans born in 1980?

B) Plot N(100, 30) vs. an Exponential (1). Describe the pattern you see and suggest a random phenomena that might produce such a pattern. Explain.

This one is kind of weird. The density is bell-shaped as you go up and down the y axis but decays exponentially as you go right along the x axis. Possible pehneomena that might display this pattern – People crowding around a window in a wall from which someone is handing out $1,000 dollar bills? The scattering of a turned over truck full of rubber duckies (Y axis is the road)?

0102030405060708090

100

Uni

form

(100

) 'B'

0 100 200

N(100, 30) 'A'

Bivariate Fit of Uniform(100) 'B' By N(100, 30) 'A'

10

20

Gam

ma

(7)

0 10 20 30 40 50 60 70 80 90 100

Uniform (100) 'A'

Bivariate Fit of Gamma (7) By Uniform (100) 'A'

0102030405060708090

100

Uni

form

(100

) 'A'

0 10 20 30 40 50 60 70 80 90 100

Uniform(100) 'B'

Bivariate Fit of Uniform (100) 'A' By Uniform(100) 'B'

C) Plot a Uniform (100) vs. a N(100, 30). Describe the pattern you see and suggest a random phenomena that might produce such a pattern. Explain.

The density of points is uniformly bell shaped along the x-axis and constant along the y-axis. Possible phenomena that might display this pattern – Trash along a north – south running highway? Riparian trees e.g. cottonwoods) along a north –south running river?

D) Plot a Gamma(7) vs. A Uniform (100). Describe the pattern you see and suggest a random phenomena that might produce such a pattern. Explain.

The point density is uniform as you travel across the x-axis but it is low, peaks, and dribbles out (skewed) as you go up the y-axis. A plot of ‘inches of rainfall per storm’ (y –axis) vs. time (x-axis) might produce this pattern. Or, this could be the pattern of where pennies land when tossed from the top of a tall building. Where the wall of the building is the x axis.

E) Plot a Uniform (100) vs. a Uniform (100). Describe the pattern you see and suggest a random phenomena that might produce such a pattern. Explain.

Now this is complete spatial randomness. Uniform distribution across x-axis, uniform distrubition across y-axis. Phenomena that might look like this –Meteorite impacts on a planimetric surface (almost Colorado –latitude effects tweak a bit); Cactus locations on a large flat desert with uniform soil, rainfall etc.?

0

100

200

N(1

00, 3

0) 'B

'

90 100 110

N(100, 3)

Bivariate Fit of N(100, 30) 'B' By N(100, 3)

F) Plot a N(100,30) vs. N(100,3). Describe the pattern you see and suggest a random phenomena that might produce such a pattern. Explain.

Be careful here. This has a striking resemblance to the ‘shotgun’ pattern you saw in ‘A’; however, note the scales. The y-axis ‘spread’ is much higher than the x-axis ‘spread’. If you ‘mapped’ this fairly it would be a ‘cigar-shaped’ pattern. Possible ways to manifest this pattern: The pattern –on the ground- of sewage dumped from a flying jet? A scatterplot of the length of a coke cans held by 2000 people and their respective weights in pounds?

#12) What is ‘Complete Spatial Randomness’ (aka CSR) for Point Patterns?

Complete spatial randomness is an important theoretical idea for geographers. Of course, the nature of what you are measuring is important; and, confounding, for this idea. What would completely spatially random temperature look like? Not a reasonable thing to think of as a random phenomena - really. Temperature is spatially auto-correlated – which we will talk about later. Temperature is also continuous – best represented as a field. In fact, temperature, pressure, humidity, rainfall, population density, and many other important geographic phenomena are often best represented as a field and ARE NOT SPATIALLY RANDOM. The fact that so many phenomena are not spatially random is a big reason geography is a discipline. Nonetheless; complete spatial randomness is an important idea to start from. Often we will test to see if phenomena manifest as non-random in space. So… Let’s consider a phenomena that we might theoretically expect to be completely spatially random (CSR): Meteorite impacts in the state of Colorado.

A) Which of the spatial distributions you produced in Question #1 of the Computer problems do you think most closely represents ‘Complete Spatial Randomness’ as we might theoretically expect for the ‘x’, ‘y’ locations of meteorite impacts in the state of Colorado? Explain.

The Pattern in ‘E’ – Uniform vs. Uniform defines an example of Complete Spatial Randomness. Any ‘X’ or ‘Y’ coordinate is equally likely. Interestingly enough, a quadratting of such a CSR pattern produces ‘counts’ that have a Poisson distribution.

#13) How do you randomly sample the earth’s surface?

A team of scientists were producing a global land cover dataset derived from the interpretation of satellite imagery obtained from Landsat. They used some complex statistical methods beyond the scope of this course to ‘classify’ all the ‘pixels’ in these images as some sort of land cover type or ocean or lake or whatever. These complex methods of ‘classification’ are not perfect. Efforts of this nature usually require a statistical approach to ‘ground truthing’ of the ‘classification accuracy’. Usually this involves the generation of ‘X’ number of points (RANDOMLY IN SPACE) in which ‘field scientists’ will got to those particular locations and determine the appropriate land cover classification of that particular point in space. Once all these field scientists have traveled to these locations and characterized them a ‘classification accuracy’ assessment can be completed.

A) Suppose the team that is doing this work prepares the following “Random Sample” of the earth’s surface according to the following rule: We generate a Random Uniform (-90, +90) distribution of ‘Y’ or ‘Latitude’ Values and pair them randomly with a Random Uniform (-180, +180) set of ‘Longitude’ Values. This set of ‘Y’ (aka Latitude) and ‘X’ (aka Longitude) pairs will randomly sample the surface of the earth (assuming it’s a sphere which we know it is not but who cares?). Will this approach randomly sample the surface of the earth – assuming it is a sphere? Explain.

NO. The earth is distorted when represented as a 360 by 180 rectangle. The poles are ‘over-represented’. Remember a 1° by 1° ‘patch’ of earth is not really even a square at the equator (although it is pretty close to one) but it gets less and less square as you approach the poles.

B) Only if you think the approach presented above is NOT random – Propose a better method of randomly sampling the surface of the earth using simple and reasonable probability density functions. Explain.

The best way to do this is with a Uniform distribution for Longitude and a Cosine distribution for Latitude. Cosine goes from 1 to 0 as ‘Y’ goes from 0 to 90°

How to Lie With Statistics Questions

#14) Write a 4 to 7 sentence summary of Chapter 4: Much ado about practically nothing

This chapter is about perhaps significant but trivial results. Is a child with a measured IQ of 103 really smarter than one with a measured IQ of 99? Given unceratainty and narrowness in the measuring process – probably not. Often Medical studies say things like: Eating oatmeal every day increases your lifespan. A study involving tens of thousands of patients might demonstrate that people who are forced to each oatmeal every day do indeed live one week longer than those who don’t. So What? I’ll take the non-oatmeal eating life – any day.

#15) Write a 4 to 7 sentence summary of Chapter 5: The Gee-Whiz Graph

The axes of a graph can be used to legally deceive. Suppose a DU supporter wants to show how much more popular DU is based on undergraduate applications. They could present a graph with the y-axis ranging from 3,000 to 3,100 and plot two successive years number of applications of 3,001 and 3,099. The line would be going up in a steep curve depending upon how far apart the two years are on the x-axis. However, this really only represents a 3% increase from one-year to the next and may in fact be random variation.

#17) Write a 4 to 7 sentence summary of Chapter 6: The one-dimensional picture

Edwin Tuffte gets his panties in a twist about this kind of distortion. The book shows pictorial examples of deception (Cows, Bags of Money, etc). If you plotted your income through time it is perfectly reasonable to use a ‘bar’ chart. Scaling linear dimmensions on 2-D representations of 3-D objects is a cognitive no-no unless you are intentionally being rhetorically sleazy.

# 18) Suppose that just the other day the Dow Jones Industrial Average dropped by 300 points. Find the last month’s worth of Dow Jones Industrial Averages and plot them. Now, Prepare two graphs: 1) A graph where you are trying to scare people that you are showing the graph to that the Dow Jones is Crashing (e.g. the sky is falling); and 2) Where you are trying to assuage the fears of your investors and ‘minimize’ this 300 point drop in the Dow Jones Industrial Average. Explain.

I went to this URL to get some DOW Jones data: http://money.cnn.com/quote/chart/chart.html?symb=djia&sid=1643&time=1mo&Submit1=RefreshBut I just ginned these two graphs up from the aether to make this point:

12,500 12,000

012,200

February 2008February 2008

Both show the same data (not really but close enough) only y-axiz scaling is different.

http://money.cnn.com/quote/chart/chart.html?symb=djia&sid=1643&time=1mo&Submit1=Refresh

http://money.cnn.com/quote/chart/chart.html?symb=djia&sid=1643&time=1mo&Submit1=Refresh

problem set #1 - university of denverurizen-geography.nsm.du.edu/~psutton/sutton_courses/geog... ·...

Documents