2ws30 parti - tu/ermcastro/2ws30/files/2ws30_parti.pdf · • number of rotten apples in each crate...

42
2WS30/39 Mathematical Statistics 2WS30 - Introduction and E.D.A./Descriptive Statistics

Upload: phungtruc

Post on 13-Mar-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

2WS30/39 Mathematical Statistics

2WS30 - Introduction and E.D.A./Descriptive Statistics

Page 2: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.2

Before we start…

http://www.win.tue.nl/~rmcastro/2WS30

Lecturers/Instructors: Alessandro di Bucchianicco Office: MF7.097a Email: [email protected] Alberto Brini Office: MF4.??? Email: [email protected] Paulo Serra Office: MF4.??? Email: [email protected]

Lecturers/Instructors: Rui M. Castro Office: MF4.075 Phone: (040 247) 2499 Email: [email protected]

Page 3: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.3

Before we start… Setup of the course:

•  Two weekly lectures •  Two weekly instructions/tutorials/advising

Student Assessment: •  Homework assignments (20%) •  Modeling project (20%) •  Final Exam (60%)

Prerequisites: •  Probability Theory (2WS20) •  A reasonable level of mathematical maturity

Page 4: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.4

Important Topics from Prob. Theory •  Expectation of r.v.’s and functions of r.v.’s

•  Computing the distribution of a function of random variables using the Probability Integral Transform

•  The properties of the sum of random variables

•  Law of large numbers

•  Central limit theorem

•  Convergence of random variables

•  Conditional expectations

Page 5: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.5

Before we start… Study Materials:

•  “Statistical Theory: A Concise Introduction”, Abramovich and Ritov •  Others (see website)

Announcements and other course materials: •  I’ll post everything on the course webpage •  I’ll send emails when necessary

I ASSUME YOU ARE ALL REGISTERED FOR THE COURSE! (you won’t receive any announcements otherwise)

Page 6: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.6

What is Statistics? According to the Encyclopedia Britannica: “Statistics is the art and science of gathering, analyzing, and making inferences from data”

In his book “Statistical Models”, A. C. Davison answers the question in a much more thorough way: “Statistics concerns what can be learned from data. Applied statistics comprises a body of methods for data collection and analysis across the whole range of science, and in areas such as engineering, medicine, business, and law - wherever variable data must be summarized, or used to test or confirm theories, or to inform decisions. Theoretical statistics underpins this by providing a framework for understanding the properties and scope of methods used in applications.”

Page 7: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.7

What is Statistics? Statistics is often associated only with polls, census, and other “boring” stuff

However, this is a very limiting view of statistics:

Page 8: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.8

Probability AND Statistics Probability and Statistics are NOT the same thing!!!

•  Probability provides the foundation of statistics •  Statistics is concerned with testing hypothesis/making inferences about the “world” by using data (assumed to be collected according to some probabilistic model)

Population (World)

Sample (small part of the population)

Statistics – Inference about the population/world from the sample

Probabilistic Model (models how data is created)

Page 9: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.9

In this Course

•  Emphasis on the theoretical underpinnings and foundations of statistical inference. •  In the modeling/homework assignments you will also encounter other aspects of statistics, such as the gathering, description and summarization of data

•  Very importantly, you’ll encounter the issues related to the choice of a “good” statistical model.

Page 10: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.10

A Typical Example

The data in this example is loosely based on a poll, as described in http://www.rasmussenreports.com/public_content/politics/elections/election_2012/election_2012_presidential_election/wisconsin/election_2012_wisconsin_president.

The presidential elections in the United States work in a funny way, and in each state there is essentially a separate election. Very important are the so-called ``swing-states'', for which it is difficult to predict the outcome of electoral process. In the 2012 election Wisconsin appeared to be such a swing-state. A phone survey (July 25, 2012) with 480 likely voters yielded the following data: 248 of individuals indicated they will vote for Barack Obama; 232 individuals indicate they will vote for Mitt Romney. What predictions can be made about the outcome of the Wisconsin election (if it was to take place on that same day)?

Page 11: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.11

German Tanks

During the II WW it was of importance for the allies to assess the number of German tanks and V2 rockets that the Germans were able to produce in a certain period of time. A lot of money was spend on intelligence to do so. However, the most successful and accurate approach was based on a relatively simple statistical approach (and some naivety by the Germans): Each German tank that was captured had serial numbers in various parts (e.g., engine block). As the name indicates, these were serial, essentially ranging from 1 to N. Assuming simplistically that each produced tank is equally likely to be captured gives a possible way to estimate N.

Page 12: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.12

German Tanks!A Concrete instance: During a certain period six German tanks were captured, with serial numbers 17, 68, 94, 127, 135, 212. Then a good estimate for N is given by

Date Estimate True value Intelligence estimate

June 1940 169 122 1000

June 1941 244 271 1550

August 1942 327 342 1550

Page 13: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.13

Biology and Estimation of Missing Mass!Suppose you are working with biologists studying the ecosystem on a certain lake. They would like to know how many species of fish inhabit the lake. They set a several (fish friendly) nets in different places and record the following catch:

You later go fishing on the lake. What is the probability you’ll encounter a species you haven’t seen before?

The Good-Turing estimator of this quantity is 2/12=0.167

Page 14: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.14

What is Data? Definition: Data and Dataset

This seems a bit vague… For our purposes: Data is a collection of numerical or categorical observations of a certain process (either physical, biological, social, etc…). Depending on the questions one wants to answer the order of the data might be important (e.g. AEX over time), other times it is irrelevant (exam grades of 2WS30 ordered by student last name).

Page 15: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.15

A Typical Dataset To better understand the impact of smoking in pregnancy a big study was conducted in the USA. All the pregnancies under a certain health cooperative (in S. Francisco) were monitored between 1960 and 1967 and figures like the mothers age, smoking status, baby weight at birth, etc… were collected (a total of 1236 valid entries) For instance, this is a list of the mother’s age (in years)

27 33 28 36 23 25 33 23 25 30 27 32 23 36 30 38 25 33 33 43 22 27 25 30 23 27 (…)

We desire to make “meaningful” statements about mothers in S. Francisco, but using only this sample…

Page 16: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.16

Descriptive Statistics Typically we can only say something sensible about data or a dataset if we assume a statistical model for it. Nevertheless, a good start is to summarize the contents of a dataset, or represent them in a palatable way. This is also a key aspect of Exploratory Data Analysis. This is the goal of Descriptive Statistics, which are either numerical or graphical summaries and representations of data. In what follows we will concentrate mostly on scenarios where the ordering of the elements in the dataset is not considered important. E.g.: •  Exam grades of 2WS30 •  Customer satisfaction ratings of a store •  Number of rotten apples in each crate of apples from a certain producer (order of the crates doesn’t matter)

Page 17: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.17

A Typical Dataset

Population

(mothers in S. Francisco)

Sample

(a small number of mothers in S. Francisco)

Our hope is that the sample is somewhat representative of the entire population…

Before trying to do this, let’s see if we can “understand” the data a bit better, and summarize it in nice ways…

Page 18: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.18

Definition: Sample Mean/Sample Average

Often it is good to have an idea of where the data values are hovering around. There are a number of natural ways to quantify this:

For the dataset of the previous slides we have

Numerical Summaries – Sample Mean

Clearly this is good information to have, but it would be good to know if mother’s age is always close to this, or differs wildly…

Page 19: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.19

Sample Variance/Standard Deviation

Definition: Sample Variance/Standard Deviation

In our example Notice the units are squared !!!

The sample standard deviation is given by

A intuitive interpretation of what the sample standard deviation represents is not so easy, but we can still understand why it does measure variability:

Page 20: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.20

Sample Variance/Standard Deviation

always non-negative

Properties: Sample Variance/Standard Deviation

The last expression makes handmade computations typically easier, but numerically it can be a very bad choice…

Page 21: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.21

The Sample Range

Definition: Sample Range

Another way to assess variability:

In our example This seems fishy. Actually, there are two entries in the data that are 99. It turns out this value is not the age of the mother, but rather indicates their age was unknown. So we must treat these two entries as missing values. Removing these you’ll get

Page 22: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.22

Other Numerical Summaries

Definition: Order Statistics

There are many other numerical summaries that are important (we’ll encounter these again, in the context of graphical representations of data)

Page 23: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.23

Sample Median and Percentiles Definition: Sample Median

This is essentially the value the “splits” the dataset in two: approximately half of the data is below the median and half is above the median. More generally, we can define

Definition: Sample Percentiles

Calculation of sample percentiles is not done the same way everywhere, and most statistical packages use a definition that involves interpolation (like the median above).

Page 24: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.24

Sample Median and Percentiles For our dataset we have that the median is 26. This value does not change if we remove the two entries valued 99. The sample median is a measure of location that is robust to outliers, unlike the sample mean.

However, the median seems to also discard a lot of information in comparison with the sample mean. A compromise between the two is the trimmed mean

Definition: 10% Trimmed Mean

In our example

Page 25: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.25

Graphical Representations Especially for large datasets, graphical representations are often much more (qualitatively) informative than numerical summaries. Perhaps we simplest graphical representation is the scatter-plot (baby weight, in grams)

It is sometimes convenient to jitter to abysses of the points, so it is easier to see what’s going on…

1500 2000 2500 3000 3500 4000 4500 5000

1500 2000 2500 3000 3500 4000 4500 5000

Page 26: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.26

Histogram of x

x

Frequency

1500 2000 2500 3000 3500 4000 4500 5000

0150

Histograms Scatterplots are still a bit difficult to read – a way we can get a better view is by aggregating data into bins

1500 2000 2500 3000 3500 4000 4500 5000

Page 27: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.27

Histograms – Choice of Binning The choice of the number of bins is a tricky business…

Too few !!!

Too many !!!

“Just right”!!!

There are rules-of-thumb for the number of bins that most software will use… You don’t need to worry too much (yet)...

Histogram of x

x

Frequency

1500 2000 2500 3000 3500 4000 4500 5000

0150

Histogram of x

x

Frequency

1000 2000 3000 4000 5000

0400

Histogram of x

x

Frequency

1500 2000 2500 3000 3500 4000 4500 5000

015

35

Page 28: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.28

Histograms Actually, if the data can be viewed as independent samples from some continuous distribution, the histogram (after proper normalization) can be interpreted as an estimate of the true underlying density function !!!

Baby weight: this histogram has a bell-like shape. Is it reasonable to model baby weight as a sample from a normal distribution?

Histogram of y

y

Density

1500 2000 2500 3000 3500 4000 4500 5000

0e+00

4e-04

8e-04

Page 29: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.29

Density Estimators Histograms are actually a very crude density estimator. There are much better alternatives, like kernel-based estimators

The principle behind all these estimators is still the same – locally averaging data. However, these can be much more accurate than the histogram.

2000 3000 4000 5000

0e+00

4e-04

8e-04

density.default(x = y, n = 50000)

N = 1236 Bandwidth = 102

Density

Page 30: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.30

15 20 25 30 35 40 45

These are funny looking plots that give a nice graphical representation of the (mother’s age) data…

Sample Median: half the data is

below this

Third Quartile: 75% of the data is

below this

First Quartile: 25% of the data is

below this Largest Value Smallest Value

Actually, box-plots are generally a bit more sophisticated…

15 20 25 30 35 40 45

Box and Whisker Plots

Page 31: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.31

15 20 25 30 35 40 45

Box and Whisker Plots

Sample Median Third Quartile First Quartile

InterQuartile Range (IQR)

Whisker extends to the smallest point within 1.5 IQR

Box and whisker plots are usually presented using the following rules:

Whisker extends to the largest point within 1.5 IQR

Outliers

These plots are easy to understand, and are therefore quite useful, we can even compare different datasets easily…

Why 1.5?

Page 32: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.32

Multiple Box Plots Birth-weight data comparison between smoking and non-smoking mothers):

Does this mean smoking is bad for you???

Non-smoking

Smoking

1500 2000 2500 3000 3500 4000 4500 5000

Comparative boxplots of Birth Weight

Page 33: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.33

Time-Sequence Plots Sometimes the order of the data matters !!! Example: PSI20 financial data over (01/07/2011 - 29/06/2012)

0 50 100 150 200 250

4500

6000

7500

Index

x

Histogram of x

x

Frequency

4000 4500 5000 5500 6000 6500 7000 7500

020

4060

80

4500 5000 5500 6000 6500 7000 7500

There is clearly a temporal trend that is completely ignored in the histogram or boxplot representations!!! The order of the data really matters…

Page 34: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.34

0 50 100 150 200 250

-0.04

0.00

Index

r

PSI20 returns Histogram of r

r

Frequency

-0.06 -0.04 -0.02 0.00 0.02 0.04

020

4060

80

-0.04 -0.02 0.00 0.02

PSI20 Example We can, however, look at the daily returns instead…

There seems to be much less of a temporal trend on the returns, so histograms and box-plots are potentially useful representations of the data…

The choice of Statistical Model is already important for description of the data!!!

Page 35: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.35

Quantile-Quantile Plots These are part of a general class of qualitative plots that are meant to help you assess some properties of the data. Namely, if the data can be reasonably modeled by independent samples from some distribution… Let’s recall some concepts from a few slides back:

Definition: Order Statistics

Page 36: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.36

Quantile-Quantile Plots We can compare the order statistics, to the values we would expect for some distributions (e.g. a normal distribution).

So this gives an easy visual way to check if assuming normality is somewhat reasonable…

Page 37: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.37

Normal Quantile-Quantile Plots Example: PSI20 Daily returns

-3 -2 -1 0 1 2 3

-0.04

0.00

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Histogram of r

rDensity

-0.04 0.00

010

2030

Daily Returns seem to be reasonably modeled by a normal Distribution !!!

Page 38: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.38

Normal Quantile-Quantile Plots Example: Synthetic data – normal distribution (sanity check)

-3 -2 -1 0 1 2 3

-3-2

-10

12

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Histogram of r

rDensity

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

Page 39: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.39

Normal Quantile-Quantile Plots Example: Synthetic data – exponential distribution

Everything seems to make sense here…

-3 -2 -1 0 1 2 3

01

23

4

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Histogram of r

r

Density

0 1 2 3 4

0.0

0.4

0.8

Too many points away from the

line

Page 40: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.40

Quantile-Quantile Plots Summary: normal QQ plots give us a qualitative way to check if data can be reasonably modeled by a normal distribution. If most points lie approximately on a straight line then the normal modeling assumption might be reasonable – otherwise it is doubtful.

-3 -2 -1 0 1 2 3

-0.04

0.00

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Histogram of r

r

Density

-0.04 0.00

010

2030

PSI20 Daily returns Birth weight

Too many points away from the

line

-3 -2 -1 0 1 2 3

1500

4000

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Page 41: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.41

Quantile-Quantile Plots

-3 -2 -1 0 1 2 3

1500

2500

3500

4500

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Page 42: 2WS30 partI - TU/ermcastro/2WS30/files/2WS30_partI.pdf · • Number of rotten apples in each crate of apples from ... The choice of the number of bins is a tricky business ... y

I.42

What’s Next Now that we can summarize and represent data in nice ways we would like to make meaningful statements about the population that gave rise to this data. For this we need to make some assumptions, leading into the notion of a Statistical Model. In this course will focus mainly on one one type of statistical model. However, going beyond this model will not be hard given the foundational knowledge you’ll develop.

Important!!! All models are wrong… …but some are useful.

(George E.P. Box)