31. regression analysis 1-1-11

8/2/2019 31. Regression Analysis 1-1-11

http://slidepdf.com/reader/full/31-regression-analysis-1-1-11 1/27

Regression Analysis

1Regression Analysis



Least-Squares Linear Regression

Enables fit of linear or exponential function to data.

The goal in regression analysis is the development

of a statistical model that can be used to predict

the values of a dependent or response variable

from the values of the independent variable(s). Linear Fits Most Common

For exponential functions, data must be transformed.




Method of Least Squares

If we have N pairs of data(xi, yi) we seek to fit astraight line through thedata of the form:

Determine constants, a0 and a1, such that thedistance between the actualy data and the fitted/predicted line is minimized. Each xi is assumed to be error

free. All the error is assumedto be in the y values.

a0=

x i x i y i ! x i

2 y i""""

x i"( )2

! N x i

2"

a1=

x i y i ! N x i y i""" x i"( )

2

! N x i

2"

xaa y10

+=




Manual Calculation Method

Seeking an equation with the form: y=a0+a1x y=0.879+0.540x

Raw Data

yi xi xiyi xi

1.2 1 1.2 1

2 1.6 3.2 2.56

2.4 3.4 8.16 11.56

3.5 4 14 16

3.5 5.2 18.2 27.04Sum 12.6 15.2 44.76 58.16

( )( ) ( )( )

( ) ( )( )

( )( ) ( )( )

( ) ( )( )540.0

16.5852.15

76.4456.122.15

879.0

16.5852.15

6.1216.5876.442.15

21

20

=!

!

=

=

!

!

=

a

a




How good is the fit?

Coefficient of Determination (R2) measures the goodness of fit and the

proportion of the variation of the y values associated with the variation inthe x variable in the regression. The ratio of the explained variation to thetotal variation.

R2 =1Perfect Fit (good prediction)

R2 =0No correlation between x and y

For engineering data, R2, will normally be quite high (0.8-0.90 or higher)

A low value might indicate that some important variable was not considered,but is affecting the results.

R2 =1!

(ax i + b! y i)2"

( y i ! y)2"= Excel Function RSQ (yi 's, xi 's)

where

y = average of the yi 's




Standard Error of Estimate SEE The standard error of estimate (SEE or Syx) is a

statistical measure of how well the best-fit line representsthe data. This is, effectively, the standard deviation of thedifferences between the data points and the best-fit line. It provides an estimation of the scatter/random error in the data

about the fitted line. This is analogous to standard deviation for

sample data. It has the same units as y.

2 degrees of freedom are lost to calculate coefficients a0 and a1.

sey=

SEE =

S yx=

yi ! ˆ yi( )2

" N ! 2

=

Excel Function STEYX ( yi 's, xi 's)

where

yi = actual value of y for a given x i

ˆ yi= predicted value of y for a given x i




Linear Regression Assumptions

Variation in the data is assumed to be normally distributed and due torandom causes.

Assuming random variation exists in y values, while x values are error

free.

Since error has been minimized in the y direction, an erroneous

conclusion may be made if x is estimated based on a value for y. For power law or exponential relationships, data needs to be

transformed before carrying out linear regression analysis.

(As we will discuss later, the method of least squares can also be

applied to nonlinear functional relationships.)




Linear Regression Example

y = 0.9977x + 0.0295

R2

= 0.9993

0.00

0.50

1.00

1.50

2.00

2.50

3.00

0.00 0.50 1.00 1.50 2.00 2.50 3.00

Length, cm

O u t p

u t ,

V o l t s

Use Excel

Chart>>Add Trendline to obtain coefficients

Functions RSQ() and STEYX() to determine R2 and SEE




Regression Analysis using Excel Analysis Tools

Linear regression is a standard feature of

statistical programs and most spreadsheetprograms. It is only necessary to input the x and y

data. The remaining calculations are performed

immediately.

Excel “Regression Analysis” macro

Performs linear regression only

Non-linear relationships must be transformed

Calculates the slope, intercept, SEE, and the upper and lower

confidence intervals for the slope and intercept

Does not produce any graphical output on the user’s plot.

Does not update automatically.

The user must interpret the results.




Linear Regression in Excel 2008


Torque, N-m (Y) RPM (X) Y Predicted Residual Residual/SEE=Residual/sey

4.89 100 4.998433207 0.108433207 0.17558474

4.77 201 4.559896053 -0.210103947 -0.340219088

3.79 298 4.138726707 0.348726707 0.564689451

3.76 402 3.687163697 -0.072836303 -0.117943051

2.84 500 3.261652399 0.421652399 0.682777249

4.12 601 2.823115245 -1.296884755 -2.100031702

2.05 699 2.397603947 0.347603947 0.562871377

1.61 799 1.963408745 0.353408745 0.572271025

-0.004341952 5.432628409 m1 b

0.000954031 0.481645161 se1 seb

0.775391233 0.617554846 r^2 sey

20.71311576 6 F df

=LINEST(A2:A9,B2:B9,TRUE,TRUE)

Outlier

Y = m1i X + b



Linear Regression Example: Omit Outlier

Regression Analysis 11

Torque, N-m (Y) RPM (X) Y Predicted Residual Residual/SEE=Residual/sey

4.89 100 5.000219168 0.110219168 0.504559919

4.77 201 4.504157858 -0.265842142 -1.21696881

3.79 298 4.02774254 0.23774254 1.088334807

3.76 402 3.516946736 -0.243053264 -1.112646171

2.84 500 3.03561992 0.19561992 0.895506407

2.05 699 2.058231795 0.008231795 0.037683406

1.61 799 1.567081983 -0.042918017 -0.196469559

-0.004911498 5.49136898 m1 b

0.000348477 0.170606738 se1 seb

0.975447633 0.218446143 r^2 sey

198.6463557 5 F df

9.479149271 0.238593586 m1 b



Uncertainties on Regression


Confidence Interval for Regression Line

SEE=sey 0.218446143

TINV(a=0.05,n=5) 2.570581835

95%C.I.=TINV(α=0.05, ν=5)*SEE/SQRT(7) 0.212239784

Prediction Band for Regression Line

95%P.I.=TINV(α=0.05, ν=5)*SEE 0.561533687

Uncertainty in Slope

Δb=TiINV(0.05,5)*se1 0.000895789

Uncertainty in Intercept

Δb=TiINV(0.05,5)*seb 0.438558582



Regression Line Confidence Intervals & Prediction Band

Not only do you want to obtain a curve fit relationship but you also want to

establish a confidence interval in the equation or measure of randomuncertainty in a curve fit.

ν=N-2 in determination of t-value. Two degrees of freedom are lost becausem1 and b are determined.

CI = ! y " ±t # ,$

SEE

N = ±t

# ,$

S yx

N = ±t

# ,$

S ey

N

where

t # ,$

= TINV (# ,$ )

(two-sided t-table)

# = 1% P

PB" ±

t # ,$ SEE = ±

t # ,$ S yx= ±

t # ,$ S ey


0

1

2

3

4

5

6

0 200 400 600 800 1000

Prediction Band -95%

CI - 95%

Torque, Lease Squares Fit

CI +95%

Prediction Band +95%

Data

T o r q u e ,

N - m

RPM




Regression Line Confidence Interval & Prediction Band

CI!in!Curve!Fit! = ±t ! 2,n"2 # sey

1

n+( x

*" x )

2

S xx$ ±t

! 2,n"2 #sey

n

! yPrediction!Band = ±t

" 2,n#2sey

n +1

n+

x*# x( )

2

S xx$ ±t

" 2,n#2sey

More accurate-minimum at mean

-flares out at low & high extremes

Approximate



Summations Used in Statistics & Regression

Variable ExpressionSample Standard Deviation

Expressions used inregression analysis

Sum of squares for evaluating CI & PI

Standard error of estimate

S x=

1

N !1" x

i! x( )

2

#$

%&

'

()

1/2

S xx=

xi ! x( )2

"

sey = SEE = S yx =

yi ! y

predicted !at ! x= xi( )2

" N ! 2

#

$%

&

'(

1/2



CI in slope and intercept


Slope, m

CI !in!slope = ±t ! 2,v " se1

Intercept, b

CI in Intercept = ±t ! 2,v " seb

Note 1: ν =n-2 .Note 2: m & b are not independent variables. Therefore,

do not apply RSS to y=mx+b to determine Δy. Instead,

use CI for curve fit.



Outliers in x-y Data Sets

Method involves computing the ratio of the

residuals (predicted-actual) to the standard error of estimate (sey=SEE)

1. Residuals=ypredicted-yactual at each xi

2. Plot the ratio of residuals/SEE for each xi

. These arethe “standardized residuals”.

3. Standardized residuals exceeding ±2 may be

considered outliers. Assuming the residuals are

normally distributed, you can expect that 95% of

residuals are in the range ±2 (that is, within 2 standarddeviations from best fit line)




Linear Regressionwith Data Transformation




Data Transformation

Commonly, test data do not show an approximate

linear relationship between the dependent (Y) andindependent (X) variables and a direct linear

regression is not useful.

The form of the relationship expected between thedependent and independent variables is often known.

The data needs to be transformed prior to performing a

linear regression.

Transformations often can be accomplished by takingthe logarithms of or natural logarithms of one or both

sides of the equation.




Common Transformations

Relationship Plot Method Transformed

Intercept, b

Transformed

Slope, m1y=αxγ Log y vs. Log x (log plot)

Log(y)=Log(α)+γLog(x)

Ln y vs. x (log-log paper)

Ln(y)=Ln(α)+γLn(x)

Log(α)

Ln(α)

γ

γ

y=αeγx Log y vs. x (semi-log plot)

Log(y)=Log(α)+γLog(e)x

Ln y vs. x (semi-log plot)

Ln(y)=Ln(α)+γx

Log(α)

Ln(α)

γ Log(e)

γ




3

3.5

4

4.5

5

0 10 20 30 40 50

Velocity, ft/s

O u t p u t V

o l t a g e ,

V D C

Regression with Transformation

Example

A velocity probe provides a voltage output that is relatedto velocity, U, by the form E=δ+εUρ

δ, ε, and ρ are constants

U (ft/s) Ei (V)

0 3.19

10 3.99

20 4.3

30 4.48

40 4.651

10

1 10 100

Velocity, ft/s

O u t p u t V

o l t a g e ,

V D C




Data Relationship Transformation

E=δ+εUρ (E=δ=3.19 at U=0)

Log(E-3.19)=Log(εUρ)

Log(E-3.19)=Log(ε)+Log(Uρ)= Log(ε)+ρLog(U)

U (ft/s) Ei (V) Lets Tranform this X Y

0 3.19

10 3.99 1.00 -0.097

20 4.3 1.30 0.045

30 4.48 1.48 0.11140 4.65 1.60 0.164

Perform Regression on the

transformed Data

Y X b


m1



Solution (Excel 2004 Output)

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.998723855

R Square 0.997449339

Adjusted R Square 0.996174009 t value t*SEE

Standard Error 0.01 3.18 0.02

Observations 4

ANOVA

df SS MS F ignificance F

Regression 1 0.038118269 0.038118 782.1106 0.00127614

Residual 2 9.74754E-05 4.87E-05

Total 3 0.038215745

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept -0.525 0.021056315 -24.9274 0.001605 -0.61547736 -0.4342812

X Variable 1 0.432 0.015438034 27.96624 0.001276 0.36531922 0.49816831

t ! ," = TINV (0.05,2) = 4.3026

Y=-0.525+0.432X

SEE=0.0070




Regression with Transformation & Uncertainty

Example 4.10

3

3.5

4

4.5

5

0 10 20 30 40 50

U, ft/s

E ,

V

E=3.19+0.298U0.432

B=Logb-0.525=Logbb=0.298

Y predicted Y+ Y- Transform it Back Again E E+ E-

3.19 3.19 3.19

-0.0931 -0.0781 -0.1082 4.00 4.03 3.970.0368 0.0519 0.0218 4.28 4.32 4.24

0.1129 0.1279 0.0978 4.49 4.53 4.44

0.1668 0.1818 0.1518 4.66 4.71 4.61




Multiple and Polynomial Regression

Regression analysis can also be performed in situations where there ismore than one independent variable (multiple regression) or for

polynomials of an independent variable (polynomial regression)

Polynomial Expression Seeks the form

Y=b+m1*x+m2*x2+……+mkx

k

Multiple Regression seeks a function of the form

Y = b + m1 ˆ x1 + m2

ˆ x2 + m3 ˆ x3 + ....+ m

k ˆ x

k

where

ˆ x may represent several independent variables

For example: ˆ x1 = x1

ˆ x2 = x2

ˆ x3 = x1 ! x2




Linear Regression in Excel 2004

Input the result values

Input desired confidence

level

Input the independent

variable




Excel 2004 Linear Regression Output

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.99964308

R Square 0.99928628

Adjusted R Squar 0.99910785

Standard Error 0.02788582

Observations 6

ANOVAdf SS MS F Significance

Regression 1 4.35502286 4.35502286 5600.45805 1.9107E-07

Residual 4 0.00311048 0.00077762

Total 5 4.35813333

Coefficients Standard Erro t Stat P-value Lower 95% Upper 95%

Intercept 0.02952381 0.02018228 1.46285828 0.21733392 -0.02651117 0.08555879

X Variable 1 0.99771429 0.01333197 74.8362082 1.9107E-07 0.9606988 1.03472978

R2

SEE=sey

intercept ”b" slope ”m1"

The lower and upper bounds for thecoefficients. To obtain the +- bound,

simply subtract the lower from the upper and divide by two.

N


31. regression analysis 1-1-11

Documents