email website - wordpress.com · are anova1 and anova2. populations of 25 ten-year-old girls in...

40
Probability and Statistics Lecturer: Charusluk Viphavakit, PhD ISE, Chulalongkorn University, 2 nd /2018 Email: [email protected] Website: https://charuslukv.wordpress.com 2141491 Research Methodology

Upload: others

Post on 08-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Probability and Statistics

Lecturer: Charusluk Viphavakit, PhD

ISE, Chulalongkorn University, 2nd/2018

Email: [email protected]: https://charuslukv.wordpress.com

2141491Research Methodology

Page 2: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Introduction to statistical analysis

2141491 Research Methodology 2https://charuslukv.wordpress.com

To prove observations and conclusions ‘beyond reasonable doubt’.

Most engineering research is numerically based and numbers are a prime outcome.

Some research is qualitatively based so method of analysis to convert qualitative results to numbers and to use statistics to deduce the reliability of the outcomes and the conclusions is required.

Page 3: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Measurement system

2141491 Research Methodology 3https://charuslukv.wordpress.com

Measurement system

Linear response

Dynamic range

Must be quantified

Resolution Sensitivity

Accuracy Errors

Page 4: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Dynamic range

2141491 Research Methodology 4https://charuslukv.wordpress.comThiel, David V. Research methods for engineers. Cambridge University Press, 2014.

A finite range of operation. A linear approximation to the response of a measurement.

The minimum value obtained from a measurement system can be limited by such effects as the noise level in the instrumentation.

The maximum range can be limited by the maximum input and output voltages of the instrument.

Page 5: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Dynamic range

2141491 Research Methodology 6https://charuslukv.wordpress.comThiel, David V. Research methods for engineers. Cambridge University Press, 2014.

What if the input is less than the range minimum?

Page 6: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Dynamic range

2141491 Research Methodology 7https://charuslukv.wordpress.comThiel, David V. Research methods for engineers. Cambridge University Press, 2014.

What if the input range is exceeded?

Page 7: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Resolution

2141491 Research Methodology 8https://charuslukv.wordpress.com

The smallest change in input that can be detected at the output.

In a digital system the resolution is assumed to be linear when measurements aremade within the range of the instrument.

A fixed range four digit digital volt meter has a maximum readout of 1.999 V. If the signal being measured changes by less than 1 mV, then no change will be observed in the display.

What is the resolution of the voltage meter ?

Page 8: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Sensitivity

2141491 Research Methodology 9https://charuslukv.wordpress.comThiel, David V. Research methods for engineers. Cambridge University Press, 2014.

The change in output divided by a small change in input. If the system has a linear response, then the sensitivity of the system is a constant

over the working range.

What is the sensitivity of this system?

Page 9: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Accuracy

2141491 Research Methodology 10https://charuslukv.wordpress.com

Cannot be controlled Can be controlled

(Randomly occurring errors) Usually through calibration

(Systematic errors)

Occur due to the limitations ofthe measurement system.

The accuracy of a measurement system is described in terms of the error.

Errors

Page 10: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Accuracy

2141491 Research Methodology 11https://charuslukv.wordpress.com

The accuracy of a measurement system is described in terms of the error.

Intrinsic problems External influences

(Instrumentation) Temperature Humidity Materials Etc.

Calibration problems Nonlinear effects Imperfect components Etc.

Systematic Errors

(Interference)

Page 11: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Accuracy

2141491 Research Methodology 12https://charuslukv.wordpress.com

The accuracy of a measurement system is described in terms of the error.

Intrinsic problems External influences

(Instrumentation)

Systematic Errors

• Improved calibration• Reduce coupling and

mechanism

• Use of standard materialsand components

• The best cleaning efforts• Environmental control

(Interference)

Reduce Reduce

Page 12: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Sample selection

2141491 Research Methodology 13https://charuslukv.wordpress.com

If more than one sample are tested, it is possible that the samples under test might haveslight differences in properties being measured.

The method of selecting samples can play a major role in the measurements obtained.

A clear scientific method of obtaining representative samples becomes important.

Page 13: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

One-dimensional statistics

https://charuslukv.wordpress.com

To gain one piece of information only – a single number.

A single number with an error marking the upper and lower bounds constitutes a one-dimensional measurement.

Commonly represented by a histogram,

2141491 Research Methodology 14Thiel, David V. Research methods for engineers. Cambridge University Press, 2014.

Page 14: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

One-dimensional statistics

https://charuslukv.wordpress.com

A common measure in statistics is the 5% probability estimate ‘rule of thumb’.

2141491 Research Methodology 15

5% of all measured values will lie outside this range of values centred on the mean value.

95% of the measurements will lie within this range.

or

Page 15: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

One-dimensional statistics

https://charuslukv.wordpress.com 2141491 Research Methodology 16Thiel, David V. Research methods for engineers. Cambridge University Press, 2014.

Assuming a normal (random) distributionabout the mean value 𝜇.

Less than 5% of the measurements will lieoutside the range of plus or minus twostandard deviations 𝜎 away from the mean.

On average, 2.5% of the population will havevalues greater than 𝜇 + 2𝜎 the upper tail ofthe population) and 2.5% of the populationwill have values smaller than 𝜇 − 2𝜎 (thelower tail of the population).

Page 16: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

One-dimensional statistics

https://charuslukv.wordpress.com 2141491 Research Methodology 17Thiel, David V. Research methods for engineers. Cambridge University Press, 2014.

What is the mean value and standard deviation of this figure?

Page 17: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

One-dimensional statistics

https://charuslukv.wordpress.com 2141491 Research Methodology 18

The levels of uncertainty based on measurement error and probabilities related torandomly distributed values are named random errors.

To define the random errors the mean 𝜇 and standard deviation 𝜎 of the total population(𝑁) have to be calculated.

𝜇 =σ𝑖=1𝑁 𝑥𝑖𝑁

𝜎 =σ𝑖=1𝑁 𝑥𝑖 − 𝜇 2

𝑁

where 𝑥𝑖 is the 𝑖th value of the population

Page 18: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

One-dimensional statistics

https://charuslukv.wordpress.com 2141491 Research Methodology 19

As it is rarely possible to make measurements on every member of the population and nopopulation can be infinite in size.

Approximate 𝜇 and 𝜎 using a smaller (hopefully representative) sample of the populationso called the sample population.

If there are 𝑛 measurements, the sample population has 𝑛 members and the mean ( ҧ𝑥)and standard deviation (𝑆) of the sample population are given by

ҧ𝑥 =σ𝑖=1𝑛 𝑥𝑖𝑛

𝑆 =σ𝑖=1𝑛 𝑥𝑖 − ҧ𝑥 2

𝑛 − 1

where 𝑥𝑖 is the 𝑖th value of the sample population

Page 19: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

z-score

https://charuslukv.wordpress.com 2141491 Research Methodology 20

The normal distribution shown earlier has been scaled such that 𝜇 = 0 and 𝜎 = 1.However, the mean value of a population is not usually zero and the standarddeviation is not usually unity.

In order to calculate the probability of obtaining particular values in a normallydistributed data set, it is common to transform that data set.

𝑧 =𝑥 − 𝜇

𝜎

The 𝑧 value (called the z-score) representation of the 𝑥 value now has a normaldistribution with 𝜇 = 0 and 𝜎 = 1

The conversion from a z-score to a probability and the reverse can be calculated inprograms such as MatLab (zscore and normcdf) and Excel (normsinv)

Page 20: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

z-score

https://charuslukv.wordpress.com 2141491 Research Methodology 21

The normal distribution is symmetrical about the mean, 50% of the values lie abovethe mean and 50% lie below.

A one-tailed probability estimate is half the probability of a two tailed probabilityestimate.

What is the probability that a value of 𝑥 = 7.76 is part of a previously measured data set where the mean value was determined to be 6.2 and the standard deviation was determined to be 2.1?

Page 21: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Skewness

https://charuslukv.wordpress.com 2141491 Research Methodology 22

There are many situations where the normal distribution cannot be used forprobability calculations.

The mean value is very close to zero and negative values are not possible in thedata set.

The distribution is skewed about the mean. This is defined numerically as theskewness of the population.

The 𝜇 ± 2𝜎 range of values do not contribute 95% of the probability. This isnumerically defined as the kurtosis of the population.

Page 22: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Skewness

https://charuslukv.wordpress.com 2141491 Research Methodology 23

𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =σ𝑖=1𝑁 𝑥𝑖 − 𝜇 3

(𝑁 − 1)𝜎3

Negative value: The population lies in the lower values.

Positive value: The population lies in the higher values.

0: Normal distribution

𝑘 =σ𝑖=1𝑁 𝑥𝑖 − 𝜇 4

(𝑁 − 1)𝜎4𝑘 ≅ 3: Normal distribution

𝑘𝑢𝑟𝑡𝑜𝑠𝑖𝑠;

The skewness and kurtosis can be calculated in programs such as MatLab (skewnessand kurtosis) and Excel (SKEW and KURT)

Page 23: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Combining errors and uncertainties

https://charuslukv.wordpress.com 2141491 Research Methodology 24

If two values are to be added or subtracted: 𝑥𝑖 ± 2𝜎𝑖 and 𝑥𝑗 ± 2𝜎𝑗, then the results

𝑦 is given by

𝑦 = 𝑥𝑖 ± 𝑥𝑗 ± 2 𝜎𝑖2 + 𝜎𝑗

2 Τ1 2

Note that all of the units must be the same.

If two values are divided or multiplied: 𝑥𝑖 ± 2𝜎𝑖 and 𝑥𝑗 ± 2𝜎𝑗, then the results 𝑦 is

given by

𝑦 =𝑥𝑖𝑥𝑗±𝑥𝑖𝑥𝑗2

𝜎𝑖𝑥𝑖

2

+𝜎𝑗𝑥𝑗

2

The units of 𝑦 and 𝑥𝑖, 𝑥𝑗 do not have to be identical.

Page 24: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Combining errors and uncertainties

https://charuslukv.wordpress.com 2141491 Research Methodology 25

Determine 𝐴 = sin 𝜃 where 𝜃 = 50 ± 5 degrees

If the mathematical operation is nonlinear, then an input number with equal limitswill result in an output number with unequal limits.

Determine 𝐵 = log 𝑑 where 𝑑 = 1.5 ± 0.3

𝜃 = 45, 50, 55 𝐴 = sin 50 + sin 55 − sin 50 and sin 50 − (sin 50 − sin 45)

𝐴 = 0.766 + 0.053 𝑎𝑛𝑑 − 0.059

𝑑 = 1.2, 1.5, 1.8

Page 25: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

t test

https://charuslukv.wordpress.com 2141491 Research Methodology 26

It is important to know if two populations are likely to be sample populations selectedfrom the same global population.

This type of one-dimensional question can be addressed using a t test.

The test is most applicable when the standard deviations are very large in comparisonto the changes or differences between the two mean values.

Page 26: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

t test

https://charuslukv.wordpress.com 2141491 Research Methodology 27

There are two main types of t test:

The paired t test The same population is tested twice to determine if there has been a changein the overall population. It is a method of determining if there is a statistically significant change in thepopulation after an intervention. A simple mean and standard deviation calculation will not show a significantchange if the change is likely to be significantly smaller than the standard deviationmeasure of the population.

Page 27: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

t test

https://charuslukv.wordpress.com 2141491 Research Methodology 28

There are two main types of t test:

The unpaired t test Two different populations are measured to determine if there is a differencebetween the two populations. The two populations are unrelated and the number of samples can be differentin the two sample sets.

The t test can be conveniently evaluated using the Excel function (ttest-paired andttest-unpaired). In Matlab the functions are (ttest and ttest2) for the paired andunpaired data sets respectively.

Page 28: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

t test

https://charuslukv.wordpress.com 2141491 Research Methodology 29

A group of 20 aluminium poles of different sizesis weighed immediately following manufacture.The same poles are weighed after six month’sexposure to the environment.

A group of 25 employees is weighed. A fitnesstrainer is asked to work with the group to loseweight. Six months later the same group ofpeople is weighed again.

The concrete strength tests in Ghana high rise buildings needto meet an international specification. A set of measurementsfrom a sample in England was used for comparison. What isthe probability that these two sets of measurements areidentical?

Populations of 25 ten-year-old girls inSweden and 30 ten year-old girls inDenmark are weighed. The objectiveis to see if there is a difference inweight between the two populations.

Page 29: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

ANOVA statistics

https://charuslukv.wordpress.com 2141491 Research Methodology 31

One-dimensional statistical methods were outlined.

Two different populations were compared using the t test.

If there are more than two dependent data sets, then the previous techniques areinadequate.

The ANOVA statistical methods allow the calculation of probability estimates for threeor more data sets.

As with the t test, The ANOVA statistical methods can determine statistically significantdifferences when the standard deviations in the parameters are much larger than thedifference between the populations.

Page 30: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

ANOVA statistics

https://charuslukv.wordpress.com 2141491 Research Methodology 32

The ANOVA test can be simply evaluated using the Excel function (ANOVA – two-factor)with replication and (ANOVA – twofactor) without replication. In Matlab the functionsare anova1 and anova2.

Populations of 25 ten-year-old girls in Sweden and 30 ten year-old girls in Denmark are weighed and their height is alsomeasured every year over a 20 year period. The objective is tosee if there is a relationship between height and weightbetween the two populations over time.

The probability that there is a difference between the twopopulations tracked over the years can be established using atwo-factor ANOVA test.

Page 31: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Two-dimensional statistics

33https://charuslukv.wordpress.com

A particular one-dimensional measurement might be dependent on the time. This implies that the mean value of the population might continue to change as time passes. This requires a further analysis based on a two-dimensional approach to the problem.

Thiel, David V. Research methods for engineers. Cambridge University Press, 2014.

A variable is measured as a function of anothervariable (e.g. time).

Commonly represented by a line graph orscatter plot.

Page 32: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Two-dimensional statistics

34https://charuslukv.wordpress.com

The method of statistical analysis is called correlation (least squares regression analysisor correlation analysis).

An initial linear fit can be used to determine if there is a relationship between themeasured parameter and time.

The river height ℎ is a linear function of time 𝑡. This can be written as

ℎ = 𝑚𝑡 + ℎ0

where ℎ0 is the river height at 𝑡 = 0𝑚 is the rate of change of the height with time (the slope of the line).

Page 33: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Pearson correlation coefficient

35

The correlation coefficient 𝑟 (sometimes referred to as the Pearson correlation coefficient)is a statistical parameter used to evaluate the quality of the linear relationship.

𝑟 =𝑛σ𝑖=1

𝑛 𝑡𝑖ℎ𝑖 − σ𝑖=1𝑛 𝑡𝑖

𝑛 σ𝑖=1𝑛 ℎ𝑖

2 − σ𝑖=1𝑛 ℎ𝑖

2𝑛 σ𝑖=1

𝑛 𝑡𝑖2 − σ𝑖=1

𝑛 𝑡𝑖2

= 1 −𝑄𝑅𝑄𝑀

and 𝑄𝑅 = σ𝑖=1𝑛 ℎ𝑖 − 𝑚𝑡 + ℎ0

2

𝑄𝑀 = σ𝑖=1𝑛 ℎ𝑖 − തℎ

2

where 𝑡𝑖 are the time values corresponding to the measurements 𝑥𝑖𝑄𝑅 is the sum of the differences squared between the measured values ℎ𝑖 and the fitted straight line defined by ℎ = 𝑚𝑡 + ℎ0𝑄𝑀 is the sum of the difference squared between the measured values ℎ𝑖 and the mean of the measured values തℎ.

Page 34: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Pearson correlation coefficient

36https://charuslukv.wordpress.com

When 𝑟 = 1, the correlation is perfect and no random error is present in the data. Theparameter 𝑥 increases linearly with time.

When 𝑟 = −1, the correlation is perfect and no random error in the data is present. Theparameter 𝑥 decreases linearly with time.

When 𝑟 ≅ 1, the data are said to be highly correlated or strongly correlated with time.

When 𝑟 ≅ 0, the data are said to be uncorrelated and there is no relationship between themeasurements and the time at which they are taken.

There are many computer programs which calculate 𝑟. In Excel the function is correland in Matlab is corrcoef.

Page 35: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Pearson correlation coefficient and Regression

37https://charuslukv.wordpress.com

When 𝑟 = 0.975, then 𝑟2 = 0.95. This means that 95% of the relationship with time islinear and 5% is random variations in the data.

Returning to the 5% probability interpretation, it is possible to say that if 𝑟 > 0.975 or 𝑟 <− 0.975 , then there is a strong linear relationship between the measurements 𝑥𝑖 and the time 𝑡𝑖 at which they were made.

Pearson correlation coefficients (𝑟) Description*0.00-0.29 Negligible correlation0.30-0.49 low correlation0.50-0.69 Moderate correlation0.70-0.89 High correlation0.90-1.00 Very high correlation

*Cicchetti, Domenic V. "Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology." Psychological assessment 6.4 (1994): 284.

Page 36: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Three-dimensional statistics

2141491 Research Methodology 38https://charuslukv.wordpress.com

A variable is measured as a function of other variables. (A third parameter is recordedduring the measurements of the original parameter.)

Scatter plots on the same axes can still be used.

Assuming the additional parameter measured was temperature.

• The plot includes circled points where thetemperature was relatively high whencompared to the total data set.

Thiel, David V. Research methods for engineers. Cambridge University Press, 2014.

𝑦 = 𝑎0 + 𝑎1𝑥1 + 𝑎2𝑥2

Page 37: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Linear regression coefficient

2141491 Research Methodology 39https://charuslukv.wordpress.com

The multiple linear regression coefficient 𝑟𝑥𝑦 is still given by

𝑟𝑥𝑦 = 1 −𝑄𝑅𝑄𝑀

and 𝑄𝑅 = σ𝑖=1𝑛 𝑦𝑖 − 𝑎0 + 𝑎1𝑥1𝑖 + 𝑎2𝑥2𝑖

2

𝑄𝑀 = σ𝑖=1𝑛 𝑦𝑖 − ത𝑦 2

Page 38: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Partial correlation coefficients

2141491 Research Methodology 40https://charuslukv.wordpress.com

Consider an experiment in which three parameters are measured 𝛼, 𝑉 and 𝑇.

If the two-dimensional correlation coefficients between the three parameters(𝑟𝛼𝑉 , 𝑟𝛼𝑇 , 𝑟𝑉𝑇) are calculated, it is possible to minimize the influence of one variable in the correlation coefficient on the other two parameters using the partial correlation coefficient 𝑟𝛼𝑇(𝑉) defined by

𝑟𝛼𝑇(𝑉) =𝑟𝛼𝑇 − 𝑟𝛼𝑉𝑟𝑉𝑇

1 − 𝑟𝛼𝑉2 1 − 𝑟𝑉𝑇

2

The above equation can be used to evaluate the correlation coefficient between 𝛼 and𝑇 without the effect of 𝑉.

Page 39: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Null hypothesis testing

2141491 Research Methodology 41https://charuslukv.wordpress.com

The objective of research projects is to provide an answer to a carefully phrasedresearch question.

The null hypothesis is not an exact complement of the hypothesis.

In statistics, the null hypothesis (𝐻0) is set to be rejected not approved!

Page 40: Email Website - WordPress.com · are anova1 and anova2. Populations of 25 ten-year-old girls in Sweden and 30 ten year-) =

Null hypothesis testing

2141491 Research Methodology 42https://charuslukv.wordpress.com

‘What is the maximum river height during this flood?’

Hypothesis can be ‘The river height increased by less than one metre during the period 8:55 to 9:18 am.’

Null hypothesis can be ‘The river height did not reach its maximum value during theperiod 8:55 to 9:18 am.’

The maximum river height was 1.63 m at 9:10 am and, following that time, the river height started to decrease.