statistics overview - engineering school class web sites · statistics overview biologists say,...
TRANSCRIPT
![Page 1: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/1.jpg)
Statistics Overview
Biologists say, “If you need to use statistics, you don’t have enough data.”
Engineers say, “If you know enough about statistics, you don’t need much data.”
• Probability distribution
• Problems with statisticians’ notation
• Hypothesis testing
• Regression analysis
• Model fitting
• Outlier rejection
• Data presentation
• Experimental designSir Ronald Aylmer Fisher
(1890-1962)
![Page 2: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/2.jpg)
Probability Distribution Functions
If I make a measurement of a variable, how do I know how that sample relates to the mean?
y(x)
x
p[y
(x)]
y(x)
µ
![Page 3: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/3.jpg)
Probability p that a value selected at random from a Gaussian distribution with mean μ and variance σ2 will have value x
µ is the mean of the distribution, given by
σ is the standard deviation of the distribution, given by
Probability P that random variable X will fall between a and b
Normal (Gaussian) Probability
µ= 0; σ= 1
µ= 0; σ= 2
µ= 0; σ= 3
µ= 4; σ= 1
Central Limit Theorem
s 2 is called the variancem and s 2 are first two moments of the PDF
![Page 4: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/4.jpg)
Other Distributions
Continuous
• Normal (Gaussian), Cauchy, Chi-square, exponential, F, gamma, Laplace, log-normal, Pareto, Student’s t, uniform, Weibull, Beta
• Von Mises distribution - the independent variable varies from -π ≤ θ ≤ π (i.e. θ is an angle)
Discrete
• Bernoulli, binomial, discrete uniform, geometric, hypergeometric,negative binomial, Poisson
Oriented muscle cells
![Page 5: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/5.jpg)
Statistics Notation
What is written
Random variable x
Probability p(x)
Problem: x is really a dependent variable
How you should read it
Random variable y(x)
Probability p[y(x)]
Probability distributions are valid for one value of the independent variable only
Example
Measure reaction rate constant k at temperatures T1 and T2
Statisticians would note this as measuring random variable k, then compute the probability p(k)
Really, your measurements measured k(T1), so the implied probabilities p[k(T1)] only apply at T=T1
![Page 6: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/6.jpg)
Example
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
Bir
ths
in 2
00
3
Birth Weight (lb)
United States
Germany
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0-1
.1
1.1
-2.2
2.2
-3.3
3.3
-4.4
4.4
-5.5
5.5
-6.6
6.6
-7.7
7.7
-8.8
8.8
-9.9
9.9
-11
.0
>11N
orm
aliz
ed
Bir
ths
in 2
00
3:
p[w
(co
un
try,
ye
ar)]
Birth Weight (lb)
United States
Germany
Data from data.un.org
![Page 7: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/7.jpg)
Hypothesis Testing
Null Hypothesis H0
Assume that two dependent variables are drawn from distributions with the same mean μ
Test this hypothesis with t-test
• t-test gives the probability that the means are “different”
• If they are “different,” then H0 is false
H0 : μ1 = μ2
µ1 = 0σ1 = 2
µ2 = 0σ2 = 1
µ2 = 1σ2 = 1
µ2 = -5σ2 = 1
![Page 8: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/8.jpg)
Testing the Hypothesis
(Student’s) T-testDetermine whether two sets of data come from “different” distributions
• “Different” = “there exists a statistically significant difference between the two”
• Statistical significance based on p value
– significantly different if p < α
– Usually, α = 0.05
• ttest() in Excel
• ttest() or ttest2() in MATLAB
ANalysis Of VAriance (ANOVA)
t-test for the case of more than one independent variable (e.g. y(x,t))
Result is again a p value telling you whether the independent variable makes a statistically significant difference in the dependent variables
Available in the Data Analysis Toolpack in Excel
![Page 9: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/9.jpg)
T-Tests
http://www.socialresearchmethods.net/kb/stat_t.php
![Page 10: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/10.jpg)
ANalysis Of VAriance (ANOVA)
See course manual p. 10-11 for a full description
Sum of squares Mean squares
Essentially a t-test for more than two samples
Total
Error/Residual
Treatment
![Page 11: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/11.jpg)
Regression AnalysisLinear Regression
Fit a line to data containing noise using the least squares method
• Minimize the sum of squared residuals
• Model with one independent variable
• Model with p-1 independent variables
• Goodness of fit
– Fraction of variance in data which is explained by model
Nonlinear Regression
Fit an arbitrary function to data containing noise, again using least squares method
R2 isn’t necessarily a good measure of goodness of fit
• L2 norm (Euclidean distance)
• Relative error in L2 norm
![Page 12: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/12.jpg)
Regression AnalysisQualitative Verification
All of these methods assume that the error ε is normally distributed
•Check by looking at plot of residuals
•Residuals should be randomly distributed around axis r = 0
Nonlinear Regression
Fit an arbitrary function to data containing noise, again using least squares method
R2 isn’t necessarily a good measure of goodness of fit
• L2 norm (Euclidean distance)
• Relative error in L2 norm
![Page 13: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/13.jpg)
Model FittingIn Excel
Add trendline – Excel does everything for you
• Only works if you want to use an available function
Goal seek
• Only works for unconstrained, one parameter models
Solver
• Can use for constrained, multiple parameter models
• Uses Quasi-Newton or conjugate gradient method
In MATLAB
Built-in functions
• Newton-Raphson method (fzero)
• Nelder-Mead simplex (fminsearch)
Optimization toolboxes
• Levenberg-Marquadt/Quasi-Newton (fminunc or fmincon)
• Simulated annealing
• Genetic algorithm (GA)
Curve fitting toolbox
Custom algorithm
All methods work by minimizing some error
•
![Page 14: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/14.jpg)
Model Fitting in Excel
![Page 15: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/15.jpg)
Model Fitting in MATLAB
![Page 16: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/16.jpg)
Outlier RejectionWhat is an outlier?
An outlier is a data point which disagrees with the other data and cannot be reproduced
Caused by measurement error, incorrect value of independent variable (i.e. user error), noise, chance, or lack of control or understanding of the process
Example:
y(x)=[1.2, 1.3, 5.0, 1.1, 1.2]T
μ = 1.96; σ = 1.70
When is a point an outlier?
Dixon’s Q Test
• Very simple – just look up a value in a table to see if it’s an outlier
Chauvenet’s Criterion
• Simple, less rigorous
• If p(xi)<1/(2n), throw it out
Grubb’s Test; Peirce’s Criterion
• Both utilize more rigorous methods
• See paper
Without outlier: μ = 1.20; σ = 0.08
![Page 17: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/17.jpg)
What makes a good figure?Clearly relates independent and dependent variables using axes and trend
lines
• Units!!
• Proper scaling– Use log scales if variable(s) vary over
orders of magnitude
Symbols and text are large and different
Resolution is sufficiently high
Error bars (if applicable)
Efficient use of space
Utilizes significant figures appropriately
Compares data with applicable model predictions
Contains enough information to get the point(s) across, but not so much that the message is lost or confused
Captioned such that it is understood without reading the text
![Page 18: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/18.jpg)
Reilly et al., Experimental Eye Research, 2008.
Presentation of Data
Which of these figures is better?
ambiguous
Was
ted
sp
ace
Significant figures
Fuzzy text
Error bars
Goodness-of-Fit
LegendFrom a journal article which was rejected.
Units
![Page 19: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/19.jpg)
Presentation of Data
Reilly et al., Biomacromolecules, 2008.
Tiffany and Koretz, International Journal of Biological Molecules, 2002.
![Page 20: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If](https://reader031.vdocuments.us/reader031/viewer/2022022716/5c14cf7109d3f25e338c7c07/html5/thumbnails/20.jpg)
Statistical Experimental Design
Design an experiment using statistical methods to minimize the number of data points required to get the desired information.
Analyze an experiment using statistical methods to maximize the information yield from any set of experiments