other regression models andy wang cis 5930-03 computer systems performance analysis

39
Other Regression Models Andy Wang CIS 5930-03 Computer Systems Performance Analysis

Upload: denis-simon

Post on 31-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

  • Other Regression ModelsAndy WangCIS 5930-03Computer SystemsPerformance Analysis

  • *Regression WithCategorical PredictorsRegression methods discussed so far assume numerical variablesWhat if some of your variables are categorical in nature?If all are categorical, use techniques discussed later in the courseLevels - number of values a category can take

  • *HandlingCategorical PredictorsIf only two levels, define bi as followsbi = 0 for first valuebi = 1 for second valueThis definition is missing from book in section 15.2Can use +1 and -1 as values, insteadNeed k-1 predictor variables for k levelsTo avoid implying order in categories

  • *Categorical Variables ExampleWhich is a better predictor of a high rating in the movie database, winning an Oscar,winning the Golden Palm at Cannes, or winning the New York Critics Circle?

  • *Choosing VariablesCategories are not mutually exclusivex1= 1 if Oscar 0 if otherwisex2= 1 if Golden Palm 0 if otherwisex3= 1 if Critics Circle Award 0 if otherwisey = b0+b1 x1+b2 x2+b3 x3

  • *A Few Data PointsTitleRatingOscarPalmNYCGentlemans Agreement7.5XXMutiny on the Bounty7.6XMarty7.4XXXIf7.8XLa Dolce Vita8.1XKagemusha8.2XThe Defiant Ones7.5XReds6.6XHigh Noon8.1X

  • *And Regression Says . . . How good is that?R2 is 34% of variationBetter than age and lengthBut still no great shakesAre regression parameters significant at 90% level?

  • *Curvilinear RegressionLinear regression assumes a linear relationship between predictor and responseWhat if it isnt linear?You need to fit some other type of function to the relationship

  • *When To UseCurvilinear RegressionEasiest to tell by sight Make a scatter plotIf plot looks non-linear, try curvilinear regressionOr if non-linear relationship is suspected for other reasonsRelationship should be convertible to a linear form

  • *Types ofCurvilinear RegressionMany possible types, based on a variety of relationships: Many others

  • *Transform Themto Linear FormsApply logarithms, multiplication, division, whatever to produce something in linear formI.e., y = a + b*somethingOr a similar formIf predictor appears in more than one transformed predictor variable, correlation likely!

  • *Sample TransformationsFor y = aebx, take logarithm of yln(y) = ln(a) + bxy = ln(y), b0 = ln(a), b1 = bDo regression on y = b0+b1xFor y = a+b ln(x), x = exDo regression on y = a + bln(x)

  • *Sample TransformationsFor y = axb, take log of both x and yln(y) = ln(a) + bln(x)y = ln(y), b0 = ln(a), b1 = b, x = exDo regression on y = b0 + b1ln(x)

  • *Corrections to Jain p. 257

    Nonlinear

    Linear

    _1107199107.unknown

    _1107592444.unknown

    _1294773948.unknown

    _1294774070.unknown

    _1295284600.unknown

    _1295286197.unknown

    _1295286198.unknown

    _1294774071.unknown

    _1294774030.unknown

    _1294773919.unknown

    _1294773932.unknown

    _1294773901.unknown

    _1107199480.unknown

    _1107199939.unknown

    _1107200133.unknown

    _1107199864.unknown

    _1107199333.unknown

    _1107199425.unknown

    _1107199215.unknown

    _1107198748.unknown

    _1107198992.unknown

    _1107199047.unknown

    _1107198833.unknown

    _1107198573.unknown

    _1107198684.unknown

    _1107198382.unknown

  • *General TransformationsUse some function of response variable y in place of y itselfCurvilinear regression is one exampleBut techniques are more generally applicable

  • *When To Transform?If known properties of measured system suggest itIf datas range covers several orders of magnitudeIf homogeneous variance assumption of residuals (homoscedasticity) is violated

  • *Transforming Due To HomoscedasticityIf spread of scatter plot of residual vs. predicted response isnt homogeneous,Then residuals are still functions of the predictor variablesTransformation of response may solve the problem

  • *What TransformationTo Use?Compute standard deviation of residuals at each y_hatAssume multiple residuals at each predicted valuePlot as function of mean of observationsAssuming multiple experiments for single set of predictor valuesCheck for linearity: if linear, use a log transform

  • *Other Tests for TransformationsIf variance against mean of observations is linear, use square-root transformIf standard deviation against mean squared is linear, use inverse (1/y) transformIf standard deviation against mean to a power is linear, use power transformMore covered in the book

  • *General Transformation PrincipleFor some observed relation between standard deviation and mean,

    let

    transform to

    and regress on w

  • *Example: Log TransformationIf standard deviation against mean is linear, then

    So

  • *Confidence Intervalsfor Nonlinear RegressionsFor nonlinear fits using general (e.g., exponential) transformations:Confidence intervals apply to transformed parametersNot valid to perform inverse transformation on intervalsMust express confidence intervals in transformed domain

  • *OutliersAtypical observations might be outliersMeasurements that are not truly characteristicBy chance, several standard deviations outOr mistakes might have been made in measurementWhich leads to a problem:Do you include outliers in analysis or not?

  • *DecidingHow To Handle Outliers1. Find them (by looking at scatter plot)2. Check carefully for experimental error3. Repeat experiments at predictor values for each outlier4. Decide whether to include or omit outliersOr do analysis both waysQuestion: Is first point in last lectures example an outlier on rating vs. age plot?

  • *Rating vs. Age

    Sheet1

    XXT

    1511811111111

    113132513202841496162

    12011911813211915391118132105

    128153

    14191XTy

    149118

    161132

    162105

    XTX(XTX) invertedXTX*XTX inverted

    y8.17279968-1.148950.003390.008361-00

    6.827913025330450.003390.00025-0.00010-01-0

    7968330451195720.00836-0.00010-0.00003001

    7.4

    7.7XTX|I7279968100

    7.52791302533045010

    7.696833045119572001

    81139.8571428571138.28571428570.142857142900

    2791302533045010

    96833045119572001

    2139.8571428571138.28571428570.142857142900

    01904.8571428571-5536.7142857143-39.857142857110

    96833045119572001

    3139.8571428571138.28571428570.142857142900

    01-2.9066296685-0.02092395380.00052497380

    96833045119572001

    4139.8571428571138.28571428570.142857142900

    01-2.9066296685-0.02092395380.00052497380

    0-5536.7142857143-14288.5714285714-138.285714285701

    5139.8571428571138.28571428570.142857142900

    01-2.9066296685-0.02092395380.00052497380

    00-30381.7494375281-254.13566821662.90662966851

    6139.8571428571138.28571428570.142857142900

    01-2.9066296685-0.02092395380.00052497380

    0010.0083647477-0.0000956703-0.0000329145

    7139.8571428571138.28571428570.142857142900

    0100.003389270.0002468958-0.0000956703

    0010.0083647477-0.0000956703-0.0000329145

    8139.85714285710-1.01386796530.01322982930.0045516047

    0100.003389270.0002468958-0.0000956703

    0010.0083647477-0.0000956703-0.0000329145

    9100-1.14895458320.003389270.0083647477

    0100.003389270.0002468958-0.0000956703

    0010.0083647477-0.0000956703-0.0000329145

    XTy8.1YT8.16.877.47.77.57.68

    111111116.8

    5132028414961627yTy452.91

    118132119153911181321057.4XT

    7.7b-0.672660159311111111

    7.50.031777743513202841496162

    7.60.057275445111813211915391118132105

    XTy59.6(XTX) inverted8bT-0.67266015930.0317777430.0572754451

    2118.9-1.148950.003390.00836ybar =7.51257.5125bTXT6.24473107397.30080924866.77867266348.98025973955.84229280367.64295176428.82614091077.3114816368

    7247.50.003390.00025-0.00010SS0 =451.50125442.0107721397

    0.00836-0.00010-0.00003SSE =10.8992278603

    b=(XTX)inverted * XTy

    -0.6726601593X

    0.031777743151186.245y8.1

    0.05727544511131327.3016.8

    1201196.7797

    1281538.9807.4

    141915.8427.7

    1491187.6437.5

    1611328.8267.6

    1621057.3118

    7.5125

    6.2447310739e1.9e23.44

    7.3008092486-0.50.25

    6.77867266340.20.05

    8.9802597395-1.62.50

    5.84229280361.93.45

    7.6429517642-0.10.02

    8.8261409107-1.21.50

    7.31148163680.70.47

    SSE = SUM(11.69

    SSY =452.91

    SST =1.40875

    SS0 =451.50125

    SSR =-10.28

    R2 =-7.2967379804

    &A

    Page &P

    Sheet2

    XXT

    1511811111111

    113132513202841496162

    12011911813211915391118132105

    128153

    14191XTXC = XTX inverted

    14911882799687.7134557285-0.022753714-0.0561563633

    1611322791302533045-0.0227537140.00032401420.0000946588

    16210596833045119572-0.05615636330.00009465880.0004368193

    y8.1XTyb=XTX inverted * XTy

    6.860.18.3726020643

    72118.90.0050953691

    7.47247.5-0.0085768848

    7.78.37260206430.0050953691-0.0085768848

    7.5

    7.6

    8ee2

    yhat7.3860065027y8.1-0.71399349730.5097867142

    7.30669306826.80.50669306820.2567378653

    7.453860154370.45386015430.2059890397

    7.20300902377.4-0.19699097630.0388054448

    7.80101567997.70.10101567990.0102041676

    7.61020274297.50.11020274290.0121446445

    7.55127078487.6-0.04872921520.0023745364

    7.78794204368-0.21205795640.0449685769

    ybar7.5125SSE1.0810109894Confidence intervals for regression parameters

    SSY452.91

    SS0451.50125b05.997189904210.7480142244

    SST1.40875b1-0.01030022760.0204909657

    SSR0.3277390106b2-0.0264526810.0092989114

    R20.2326452604

    se0.4244625993MSR0.1638695053

    so1.1788645956MSE0.2162021979

    s10.0076404946MSR/MSE0.7579456034

    s20.0088713629

    Correlation of x1 and x2

    n8

    sum x1279

    sum x2968

    sum x1^213025

    sum x2^2119572

    sum x1x233045

    R-0.25161014

    &A

    Page &P

    Sheet3

    Categorical regression example

    &A

    Page &P

    Chart1

    8.1

    6.8

    7

    7.4

    7.7

    7.5

    7.6

    8

    &A

    Page &P

    Rating

    Age

    Rating

    Chart2

    8.1

    6.8

    7

    7.4

    7.7

    7.5

    7.6

    8

    &A

    Page &P

    Rating

    Length

    Rating

    Sheet4

    AgeRatingLengthRating

    58.11188.1

    136.81326.8

    2071197

    287.41537.4

    417.7917.7

    497.51187.5

    617.61327.6

    6281058

    &A

    Page &P

    Sheet5

    &A

    Page &P

    Sheet6

    &A

    Page &P

    Sheet7

    &A

    Page &P

    Sheet8

    &A

    Page &P

    Sheet9

    &A

    Page &P

    Sheet10

    &A

    Page &P

    Sheet11

    &A

    Page &P

    Sheet12

    &A

    Page &P

    Sheet13

    &A

    Page &P

    Sheet14

    &A

    Page &P

    Sheet15

    &A

    Page &P

    Sheet16

    &A

    Page &P

  • *Common Mistakesin RegressionGenerally based on taking shortcutsOr not being carefulOr not understanding some fundamental principle of statistics

  • *Not Verifying LinearityDraw the scatter plotIf its not linear, check for curvilinear possibilitiesMisleading to use linear regression when relationship isnt linear

  • *Relying on ResultsWithout Visual VerificationAlways check scatter plot as part of regressionExamine predicted line vs. actual pointsParticularly important if regression is done automatically

  • *Some Nonlinear Examples

    _1107201370.xls

    _1107201476.xls

    _1107201830.xls

    _1107202023.xls

    _1107202097.xls

    _1107202135.xls

    _1107202060.xls

    _1107201878.xls

    _1107201613.xls

    _1107201433.xls

    _1107201446.xls

    _1107201387.xls

    _1107201283.xls

    _1107201335.xls

    _1107200761.xls

  • *Attaching ImportanceTo Values of ParametersNumerical values of regression parameters depend on scale of predictor variablesSo just because a particular parameters value seems small or large, not necessarily an indication of importanceE.g., converting seconds to microseconds doesnt change anything fundamentalBut magnitude of associated parameter changes

  • *Not SpecifyingConfidence IntervalsSamples of observations are randomThus, regression yields parameters with random propertiesWithout confidence interval, impossible to understand what a parameter really means

  • *Not Calculating Coefficient of DeterminationWithout R2, difficult to determine how much of variance is explained by the regressionEven if R2 looks good, safest to also perform an F-testNot that much extra effort

  • *Using Coefficient of Correlation ImproperlyCoefficient of determination is R2Coefficient of correlation is RR2 gives percentage of variance explained by regression, not RE.g., if R is .5, R2 is .25And regression explains 25% of varianceNot 50%!

  • *Using Highly Correlated Predictor VariablesIf two predictor variables are highly correlated, using both degrades regressionE.g., likely to be correlation between an executables on-disk and in-core sizesSo dont use both as predictors of run timeMeans you need to understand your predictor variables as well as possible

  • *Using Regression Beyond Range of ObservationsRegression is based on observed behavior in a particular sampleMost likely to predict accurately within range of that sampleFar outside the range, who knows?E.g., regression on run time of executables smaller than size of main memory may not predict performance of executables that need VM activity

  • *Using Too ManyPredictor VariablesAdding more predictors does not necessarily improve model!More likely to run into multicollinearity problemsSo what variables to choose?Subject of much of this course

  • *Measuring Too Littleof the RangeRegression only predicts well near range of observations If you dont measure commonly used range, regression wont predict muchE.g., if many programs are bigger than main memory, only measuring those that are smaller is a mistake

  • *Assuming Good PredictorIs a Good ControllerCorrelation isnt necessarily controlJust because variable A is related to variable B, you may not be able to control values of B by varying AE.g., if number of hits on a Web page correlated to server bandwidth, but might not boost hits by increasing bandwidthOften, a goal of regression is finding control variables

  • White Slide

    This belongs in the advanced regression lecture.