two-parameter ridge regression and its convergence to the eventual pairwise model

7/31/2019 Two-Parameter Ridge Regression and Its Convergence to the Eventual Pairwise Model

http://slidepdf.com/reader/full/two-parameter-ridge-regression-and-its-convergence-to-the-eventual-pairwise 1/15

Mathematical and Computer Modelling 44 (2006) 304–318

www.elsevier.com/locate/mcm

Two-parameter ridge regression and its convergence to the eventualpairwise model

Stan Lipovetsky∗

GfK Custom Research Inc., 8401 Golden Valley Road, Minneapolis, MN 55427, United States

Received 22 September 2005; received in revised form 6 January 2006; accepted 9 January 2006

Abstract

A generalization of the one-parameter ridge regression to a two-parameter model and its asymptotic behavior, which has various

better fitting characteristics, is considered. For the two-parameter model, the coefficients of regression, their t -statistics and net

effects, the residual variance and coefficient of multiple determination, the characteristics of bias, efficiency, and generalized

cross-validation are very stable and as the ridge parameter increases they eventually reach asymptotic levels. In particular, the

beta coefficients of multiple two-parameter ridge regression models converge to a solution proportional to the coefficients of

paired regressions. The suggested technique produces robust regression models not prone to multicollinearity, with interpretable

coefficients and other characteristics convenient for analysis of the models.c 2006 Elsevier Ltd. All rights reserved.

Keywords: Ridge regression; Multicollinearity; Robust regression; Paired models

1. Introduction

Regression analysis is one of the main tools of statistical modeling widely used for the prediction of the

dependent variable from a set of independent variables. Regressions are very useful for prediction, but can be not

so helpful for the analysis and interpretation of the role of the individual predictors because of their multicollinearity.

Multicollinearity makes the parameter estimates fluctuate wildly with a negligible change in the sample, yields signs

of the regression coefficients opposite to the signs of the pair correlations, results in theoretically important variables

having insignificant coefficients, causes a reduction in statistical power, and leads to wider confidence intervals for the

coefficients with the result that they could be incorrectly identified as being insignificant [1,2]. For overcoming the

deficiencies of multicollinearity the one-parameter ridge regression is probably the most widely used technique [3–8],

although some other approaches have also been developed [9–12]. The one-parameter ridge model, or Ridge-1, secures

the solution for an ill-conditioned or even degenerate matrix of predictor correlations. However, in comparison with

ordinary least squares (OLS) regression, Ridge-1 has a worse quality of fit. It also does not satisfy the Pythagorean

relationship between the empirical sum of squares and those explained and not explained by the regression. This makes

Ridge-1 similar to non-linear models, with the results more difficult to interpret. Parameter k of Ridge-1 is usually

∗ Tel.: +1 763 542 0800; fax: +1 763 542 0864. E-mail address: [email protected].

0895-7177/$ - see front matter c

2006 Elsevier Ltd. All rights reserved.

doi:10.1016/j.mcm.2006.01.017

http://www.elsevier.com/locate/mcm

mailto:[email protected]


http://dx.doi.org/10.1016/j.mcm.2006.01.017

http://dx.doi.org/10.1016/j.mcm.2006.01.017


http://www.elsevier.com/locate/mcm



S. Lipovetsky / Mathematical and Computer Modelling 44 (2006) 304–318 305

chosen to be a positive number close to zero, because with a larger k the coefficients of the model quickly approach

zero, and the regression quality characteristics quickly diminish. But with a small k the coefficients of Ridge-1 are

usually similar to those of OLS, and therefore could have opposite signs to paired correlation signs, which can impair

the interpretability of the regressor influence in the model.

In this paper a technique of two-parameter ridge regression, or Ridge-2, is considered, and various characteristics

of ridge solutions are described, including the effective number of parameters, the residual error variance, the

bias and variance of the parameter estimates, and the generalized cross-validation (GCV) of the prediction quality

[13–17]. With all these characteristics Ridge-2 always outperforms the Ridge-1 model. Besides providing a better

approximation, Ridge-2 has good properties of orthogonality between residuals and predicted values of the dependent

variable. Unlike for Ridge-1, increasing the parameter k stabilizes the Ridge-2 coefficients, which eventually converge

to a solution proportional to the paired regressions of the dependent variable by the independent ones. Thus, the

multiple regression coefficients attain the interpretability of the paired relations, and at the same time the quality of fit

remains high.

This paper is organized as follows. Sections 2 and 3 describe the ordinary least squares and Ridge-1 regressions,

respectively. Section 4 considers the generalization to Ridge-2 regression, with several criteria for the choice of the

tuning parameter, and convergence to the eventual pairwise ridge solution. Section 5 presents numerical results, and

Section 6 summarizes.

2. OLS regression

Let us consider briefly the ordinary least squares (OLS) regression model and some of its features. For the

standardized (centered and normalized by standard deviation) variables a multiple linear regression is yi = β1 xi1 +· · · + βn xin + εi , or in matrix form:

y = ˜ y + ε = X β + ε, (1)

where X is an N by n matrix with elements xi j of i th observations (i = 1, . . . , N ) of j th independent variables

( j = 1, . . . , n), y is the vector of observations for the dependent variable, β is the nth-order vector of beta coefficients

for the standardized regression, ˜ y is the value of the dependent variable ˜ y = X β predicted by the model, and ε is

a vector of deviations from the theoretical relationship. The least squares objective for the regression corresponds to

minimizing the sum of squared deviations:

S2 = ε2 = ( y − X β) ( y − X β) = 1 − 2β r + β C β, (2)

where a prime denotes transposition, the variance of the standardized y equals one, y y = 1, the symbols C and r

correspond to the correlation matrix C = X X and the vector of the correlations with the dependent variable r = X y.

The first-order condition of the objective (2) minimized by the vector ∂ S2/∂β = 0 yields a normal system of equations

with the corresponding solution:

C β = r , β = C −1r . (3)

The vector of standardized coefficients of regression β in the OLS solution (3) is defined via the inverse correlation

matrix C −1

. The quality of the model is estimated using the residual sum of squares (2), or using the coefficient of multiple determination defined as:

R2 = 1 − S2 = 2β r − β C β = β (2r − C β). (4)

At the minimum of the objective (2) the coefficient of multiple determination reaches its maximum, which (taking

into account the relations (3)) can be presented in the equivalent forms:

R2 = β r = β C β. (5)

The terms β j r j of the scalar product in (5) are the net effects used to estimate the contribution of the regressors to the

model (1).

Let us describe some properties of OLS regression needed for further consideration. Using the theoretical vector

˜ y,

(1), it is possible to represent the system C β = r (3) as X ˜ y = X y. This means that the vector of paired correlations



306 S. Lipovetsky / Mathematical and Computer Modelling 44 (2006) 304–318

of predictors with the empirical y coincides with such a vector of correlations with the theoretical ˜ y. Using residual

errors ε = y − X β = y − ˜ y from (1), it is easy to represent the system X ˜ y = X y as follows:

X ε = 0. (6)

This shows the orthogonality between each regressor xk (columns in matrix X ) and the vector of residual errors. For

non-standardized variables, when the first column in the design matrix represents an identity vector e (correspondingto the intercept in the regression), the first equation in the system (6) expresses the equality of total errors to zero,

eε = 0, which also represents the equality of means for empirical and theoretical vectors, ¯ y = ¯̃ y. A scalar product

of beta coefficients with the vectors in (6) shows that the relation of orthogonality with errors also holds for the

theoretical ˜ y:

β X ε = ˜ yε = 0. (7)

Using ε = y − ˜ y in (7) yields the equality:

y ˜ y = ˜ y ˜ y = β X X β = R2, (8)

or the projection of the empirical on the theoretical vector of the dependent variable equals the theoretical sum of

squares for vector ˜ y, which is the coefficient (5) of multiple determination. The sum of squares

y y = ( ˜ y + ε)( ˜ y + ε) = ˜ y ˜ y + 2 ˜ yε + εε, (9)

using relations y y = 1 from (2), the coefficient (8), the orthogonality (7), and the residual sum of squares (2), can be

represented as:

1 = R2 + S2. (10)

It is a Pythagorean relationship between the unit of the original empirical sum of squares and the sum of squares

explained ( R2) and not explained (S2) by the regression. This decomposition of the empirical standardized squared

norm of the dependent variable is very convenient for estimating the model quality using the total F -statistic defined

by the ratio of variances for the regression and the residual portions in (10).If any regressors are highly correlated or multicollinear, matrix C (3) becomes ill-conditioned, so its determinant

is close to zero, and the inverse matrix in (3) produces a solution with highly inflated values of the coefficients

of regression. The values of these coefficients often have signs opposite to the corresponding pair correlations of

regressors with the dependent variable, presumably important variables become statistically insignificant, and some

of the net effects become negative. Such a regression can be used for prediction, but it becomes useless for the analysis

and interpretation of the predictors’ role in the model.

3. Ridge-1 regression

One-parameter ridge regression is widely used for overcoming the difficulties of multicollinearity. For the

standardized variables, adding to the least squares objective (2) a penalizing function of the squared norm for the

vector of regression coefficients (which prevents their inflation) yields a conditional objective:

S2 = ε2 + k βrd2 = 1 − 2β rdr + β

rdC βrd + k β rdβrd, (11)

where βrd denotes a vector of the ridge regression estimates for the coefficients in (1), and k is a positive parameter.

Minimizing the objective (11) using vector βrd yields a system of equations and its corresponding solution as follows:

(C + k I )βrd = r , βrd = (C + k I )−1r , (12)

where I is the identity matrix of nth order. This solution exists even for a singular matrix of correlations C . If k = 0

the Ridge-1 model (11) and (12) reduces to the OLS regression model (2) and (3).

Ridge-1 has proved to be very helpful in applied regression modeling, although its results lack the quality of fit

of a regular regression, and cannot always be clearly interpreted because its properties are more similar to those of a




non-linear regression. For instance, the coefficient of multiple determination (4) with the solution (12) is:

R2rd = β

rd(2r − C βrd) = 2r (C + k I )−1r − r (C + k I )−1C (C + k I )−1r . (13)

But r (C + k I )−1r = r (C + k I )−1C (C + k I )−1r , or in other notation

β rd

r =

β rd

C βrd, (14)

so the coefficient of multiple determination (13) of Ridge-1 cannot be reduced to an expression similar to (5) for the

OLS model. With the theoretical vector ˜ yrd = X βrd from the Ridge-1 solution, the relation (14) can be represented as

y ˜ yrd = ˜ yrd ˜ yrd, so Ridge-1 does not satisfy the relations (5) and (10) holding for the OLS regression.

The eigenproblem Ca = λa for the matrix of correlations among the regressors yields the spectral decompositionC = A diag(λ j ) A, where A is the matrix containing the eigenvectors a j in its columns, and diag(λ j ) is a diagonal

matrix of the eigenvalues λ j . By the eigenproblem results, the Ridge-1 solution (12) can be presented as follows:

βrd = A diag((λ j + k )−1) Ar . (15)

Increasing k pushes the Ridge-1 solution (15) to zero proportionally to the reciprocal of the k value. The coefficient

of multiple determination (13) can be represented as:

R2rd = r A diag

2

λ j + k − λ j

(λ j + k )2

Ar = r A diag

λ j + 2k

(λ j + k )2

Ar

= r (C + k I )−1(C + 2k I )(C + k I )−1r = β rdC βrd + 2k β

rdβrd. (16)

So R2rd = β

rdC βrd, as in the case of OLS (5). The quality of fit measured by the coefficient of multiple determination

for the Ridge-1 model, (16), reaches zero proportionally to the reciprocal of k .Some other characteristics of the Ridge-1 model include effective number of parameters, residual error variance,

bias and variance of the parameter estimates, and generalized cross-validation (GCV) of the prediction quality. The

effective number of parameters is:

n∗ = Tr((C + k I )−1

C ) =n

j=1

λ j

λ j + k , (17)

where Tr denotes the trace of the matrix. For k = 0 the value (17) coincides with the total number n of the predictors,

and with a larger k the value of n∗ diminishes. The residual error variance is:

S2res = (1 − R2

rd)/( N − n − 1), (18)

where R2rd is defined by (16), and the effective number of parameters n∗, (17), can be used in place of the number n.

The expectation for the vector of parameters is:

M (βrd) = (C + k I )−1 M (r ) = (C + k I )−1 X M ( y) = (C + k I )−1 X M ( ˜ y + ε)

=(C

+k I )−1( X X βrd

+M (ε))

=(C

+k I )−1C βrd,

(19)

with the error expectation M (ε) = 0. For any k > 0 the matrix product (C + k I )−1C does not coincide with the

identity matrix, so the Ridge-1 solution is biased, while the OLS (k = 0) solution is not biased. A measure of the

mean bias can be constructed as the squared norm of the difference between the matrix (19) and the identity matrix

divided by the effective number of parameters, (17):

Bias(βrd) = 1

n∗ (C + k I )−1C − I 2 = 1

n∗ Tr

diag

λ j

λ j + k − 1

2

= k 2

n∗

n j=1

1

(λ j + k )2. (20)

The variance of the parameter estimates is defined via the squared norm of the deviation of the solution (12) from its

expectation (19):

Var(βrd) · n = βrd − M (βrd)2 = β rdβrd − 2β rd M (βrd) + M (β rd) M (βrd). (21)




The generalized cross-validation criterion widely used for choosing k value is defined as:

GCV = N ( I − H ) y2

(Tr( I − H ))2, (22)

where the hat-matrix H corresponds to projection of the empirical to the theoretical vector of the dependent variable

for the Ridge-1 model, or ˜ y = X βrd = X (C + k I )−1 X y ≡ H y. Note that Tr( H ) = Tr( X X (C + k I )−

1), so it equals

the effective number of parameters, (17). Then (22) can be represented explicitly as follows:

GCV = N y − ˜ y2

( N − n∗)2= 1

1 − n∗/ N ·

N i=1

( yi − ˜ yi )2

N − n∗ , (23)

so the meaning of the GCV characteristic is the unbiased residual error variance estimated from the effective number

of parameters. The fraction n∗/ N (the finite population correction) is usually negligible, and the GCV as an unbiased

residual variance has a minimum among the number of variables n∗, or in the k -parameter profile.

4. Ridge-2 regression and its convergence to the eventual ridge model

Consider a generalization of the regularization (11) with two additional items:

S2 = y − X b2 + k 1b2 + k 2 X y − b2 + k 3 y( y − X b)2

= (1 − 2br + bCb) + k 1(bb) + k 2(r r − 2br + bb) + k 3(1 − 2br + brr b).(24)

The vector b is an estimator of the coefficients of regression (1) from the multiple objective (24) with positive

parameters k . The first two items in (24) coincide with those in the Ridge-1 objective (11). The next item with k 2pushes the estimates b closer to the paired correlations r with the dependent variable, which helps to get a solution

with interpretable coefficients. The last item with k 3 expresses the relation yε = y y − y ˜ y = 1 − ˜ y ˜ y = 1 − R2

due to (8), so its minimizing corresponds to obtaining the maximum coefficient of multiple determination. Some other

regularization criteria have also been considered [18–20].Minimization of (24) yields a matrix equation Cb + k 1b + k 2b + k 3rr b = r + k 2r + k 3r . The scalar product

r b can be considered as another constant and combined with the parameter k 3, so this item on the left-hand side is

proportional to vector r and can be transferred to the right-hand side of this equation. Then combining constants on

each side of this equation, it is easy to reduce it to the following system with the corresponding solution:

(C + k I )b = qr , b = q(C + k I )−1r , (25)

where k and q are two new constant parameters. It is a Ridge-2 model that differs from the Ridge-1 (12) only in

the parameter q, so the solution of Ridge-2 is proportional to the Ridge-1 solution. This seemingly minor change

produces, as will be shown, very important improvements.

For a current value of the free ridge parameter k , the value of the second parameter q can be found using a criterion

of maximum quality of the regression fit. For such a criterion, consider first the coefficient of multiple determination

(4). Substituting solution (25) into (4) yields a function quadratic in q:

˜ R2 = 2q[r (C + k I )−1r ] − q2[r (C + k I )−1C (C + k I )−1r ] ≡ 2q Q1 − q2 Q2, (26)

where ˜ R2 denotes the coefficient of multiple determination for the Ridge-2 model, Q1 and Q2 denote two quadratic

forms in the brackets of (26). The value q where this concave function reaches its maximum is:

q = Q1

Q2= r (C + k I )−1r

r (C + k I )−1C (C + k I )−1r . (27)

So the second parameter in the solution (25) is uniquely defined as a quotient of two quadratic forms dependent on

one parameter k . Substituting q from (27) into (26) yields the maximum coefficient of multiple determination, as well




as its representation in the following two equivalent forms:

˜ R2 = [r (C + k I )−1r ]2

r (C + k I )−1C (C + k I )−1r = r b = bCb. (28)

This expression shows that the coefficient of multiple determination for Ridge-2 is presented similarly to the OLS

regression (5).Using the theoretical vector ˜ y = X b it is easy to see that the right-hand side equality in (28) presents the property

y ˜ y = ˜ y ˜ y for Ridge-2, similar to (8) for OLS regression. This equality leads to ˜ yε = 0, so (7) and all the relations

(8)–(10) are true for Ridge-2. Conditions of orthogonality of the predictors with errors (6) are not fulfilled exactly, but

numerical evaluations show that the deviations from orthogonality for Ridge-2 are always very close to zero. Instead

of orthogonality for each predictor (6), there exists orthogonality of their linear combination with the weights of the

coefficients of Ridge-2 regression. Indeed, rewriting ˜ yε = 0 as b X ε = 0 shows that the scalar product of the vector

of solution b with the vector X ε of deviations from orthogonality equals zero.

Thus, in the Ridge-2 solution (25) the term k serves for regularization of an ill-conditioned matrix, and the term q

is used for tuning the fit quality. The coefficient of multiple determination for the Ridge-2 model with the additional

parameter is always bigger than that of the Ridge-1 model. Using the term (27) in (25) yields the Ridge-2 solution in

the explicit form:

b = r (C + k I )−1r

r (C + k I )−1C (C + k I )−1r (C + k I )−1r . (29)

Consider the behavior of this solution with the parameter k increasing. In the limit of big k , the matrix C + k I

gets a dominant diagonal, so the inverse matrix (C + k I )−1 reduces to the scalar matrix k −1 I . The terms k −2 in

the numerator and denominator of (29) are mutually cancelled, and the Ridge-2 solution eventually converges to an

asymptote independent of k :

b = k −2r r

k −2r Cr r =

r r

r Cr

r ≡ γ r , (30)

where the constant γ is defined as the always positive ratio of two quadratic forms. Thus, unlike for the Ridge-1 solution (12) diminishing to zero, the coefficients of the Ridge-2 solution become proportional to the elements

of the vector r (30) of the pairwise regressions with beta coefficients equal the pair correlations of y with each

regressor. The signs of the Ridge-2 coefficients b coincide with the signs of the pair correlations r . This guarantees

clear interpretability of the eventual Ridge-2 solution, and the positive net effect contributions b j r j = γ r 2 j (5) of the

regressors to the coefficient of multiple determination (28).

Behaviors of the tuning parameter q (27) and of the coefficient of multiple determination (28) with k increasing

are as follows, respectively:

q = k −1r r

k −2r Cr = γ k , ˜ R2 = k −2(r r )2

k −2r Cr = γ (r r ), (31)

with the same constant γ as in (30). Thus, eventually q is linearly proportional to k , but the coefficient of multipledetermination with k increasing goes to a constant independent of k , so the quality of fit is not decreasing. The constant

of proportionality in (30) and (31) can be estimated using the squared norm defined by the maximum eigenvalue of

the eigenproblem Ca = λa, so:

γ = r r

r Cr ≥ r 2

r 2C = 1

λmax. (32)

Numerical runs support the above-described features of the eventual ridge regression. For increasing parameter k ,

the Ridge-2 coefficient of multiple determination ˜ R2 (28) stays consistently, (31), close to the maximum R2 value,

(5), of the OLS model, while the Ridge-1 coefficient R2rd, (16), quickly diminishes to zero. This means that in Ridge-2

modeling, k can be increased without losing the quality of regression until the asymptotic solution (30) is reached,

with the interpretable coefficients of multiple regression proportional to the pair correlations of y with the x’s.




Consider characteristics of Ridge-2 regression similar to those of the relations (17)–(22) of Ridge-1. The effective

number of parameters n∗2 for Ridge-2 is defined as (17) multiplied by the term q , (27). For big k the term q becomes

proportional to it, (31), and using (32) yields:

n∗2 = qTr((C + k I )−1C ) = k

λmax

Tr(C )

k = n

λmax. (33)

So the effective number of parameters for Ridge-2 reaches an asymptote. The residual error variance (18) for Ridge-2

can be found from the coefficient of multiple determination, (28). Using its eventual reach, (31), and the effective

number (33), it becomes:

S2res = (1 − r r /λmax)/( N − n/λmax − 1), (34)

so for big k the residual variance is also stabilized.

The expectation for the Ridge-2 parameters equals the Ridge-1 expression (19) multiplied by q:

M (b) = q(C + k I )−1 M (r ) = q(C + k I )−1 X M ( ˜ y + ε) = q(C + k I )−1Cb. (35)

Unlike for (19) where the bias increases with the k value, the bias of Ridge-2 coefficients stabilizes. Indeed, due to(31) and (32), the expression (35) reduces to M (b) = λ−1maxCb which does not depend on k . The mean bias for Ridge-2

can be constructed similarly to that for Ridge-1, (20). Eventually for big k , this expression goes to a constant:

Bias(b) = 1

n∗2

q(C + k I )−1C − I 2 = λmax

nTr

diag

λ j

λmax− 1

2

, (36)

where the relations (31)–(33) are used. It is interesting to note that the scalar product of the expectation (35) with the

vector of correlations yields exactly the expression (28):

r M (b) = qr (C + k I )−1C b = q2r (C + k I )−1C (C + k I )−1r = ˜ R2, (37)

where q , (28), and b, (25), are used. So in spite of the bias, the coefficient of multiple determination for Ridge-2 isthe same for the vector of solution b (25) and for its expectation M (b) (35). This property does not hold for Ridge-1

regression. The efficiency of the solution is defined according to the variance of the parameter estimates for Ridge-2,

similarly to the case for Ridge-1, (21), via the squared norm of deviation of the solution (25) from its expectation,

(35):

Var(b) · n = b − M (b)2 = bb − 2qb(C + k I )−1Cb + q2b(C + k I )−2C 2b. (38)

The variance can be estimated using the effective number of parameters (33) in place of the total n.

The tuning parameter q can not only be found from the criterion (26) of the maximum for the coefficient of multiple

determination. Consider a criterion of a minimum variance of the coefficient estimates, (38), with b, (25), when (38)

reduces to:

Var(b) · n = q2[r (C + k I )−2r ] − 2q3[r (C + k I )−3Cr ] + q4[r (C + k I )−4C 2r ]≡ q2 Q3 − 2q3 Q4 + q4 Q5,

(39)

where Q3, Q4, and Q5 denote the corresponding quadratic forms in the brackets (this notation is used to distinguish

the quadratic forms in (39) from those in (26) and (27)). Note that the matrix C and the matrices of its functions

are commutative because their sets of eigenvectors are the same. The derivative of (39) with respect to q yields

an equation 2q(Q3 − 3q Q4 + 2q2 Q5) = 0, and the only root defining the local minimum of the variance (39) is

q = (3Q4 +

9Q24 − 8Q3 Q5)/(4Q5). This q guarantees maximum efficiency of the estimates and can be used in the

solution of (25) and (26) and in other characteristics of the Ridge-2 regression.

The tuning parameter q can also be found by minimization of the GCV criterion (22) applied to Ridge-2. The

estimated value of the dependent variable for the solution of (25) is ˜ y = X b = q X (C + k I )−1 X y ≡ H 2 y, where the




projection hat-matrix for the Ridge-2 model is denoted as H 2. Substituting it into the GCV, (22), yields:

GCV = N 1 − 2q[r (C + k I )−1r ] + q2[r (C + k I )−1C (C + k I )−1r ]

( N − qTr((C + k I )−1C ))2= N

1 − 2q Q1 + q2 Q2

( N − qn∗)2, (40)

where Q1 and Q2 denote the same quadratic forms as in (26), and n∗ is the effective number of parameters, (17). The

expression in the numerator of (40) corresponds to the residual sum of squares 1 − ˜ R2

, with ˜ R2

defined in (26), sothe GCV criterion (40) can be interpreted as an unbiased residual variance for the Ridge-2 model. Minimizing (40)

using the parameter q yields the expression q = (Q1 − n∗/ N )/(Q2 − Q1n∗/ N ). The fraction n∗/ N is usually small

and diminishing with k increase; this expression is actually close to q = Q1/Q2 — that is, the same q parameter as

in (27). This guarantees the predictive optimum due to GCV criterion (40). Numerical simulations demonstrate that

the results of Ridge-2 modeling are very similar with any criterion, (26) and (39), or (40), used to define the tuning

parameter q .

All regular statistical characteristics of a multiple regression can be constructed for the Ridge-2 model as well,

including the variance–covariance matrix for the coefficient estimates and their t -statistics. The F -statistic for Ridge-

2 regression with the property of regular decomposition (10) can be defined similarly to OLS regression via the

coefficient of multiple determination. The transformation of the standardized coefficients b to the coefficients a in the

original units can be performed in the same way as for the regular regression, and the Ridge-2 solution is invariantwith respect to the scale transformation of the variables, similarly to the OLS regression.

Thus, Ridge-2 can reach the eventual pair regression coefficients in multiple regression without losing the quality

of fit and prediction. Suppose all paired correlations in r are positive, or the scales of the predictors with negative

correlations with y can be reversed to make all correlations positive. The positive solution in the ridge modeling can

be achieved when the matrix C + k I with k increasing begins to satisfy the conditions of the Farkas lemma, or some

other known criteria [21,22]. It is always possible to obtain the positive solution simply by increasing k , and no matter

how large k is for the Ridge-2 model, the eventual model is practically at the same level of quality of fit.

5. Numerical examples

Consider numerical runs for various characteristics of ridge models traced by means of the increasing k value.For example, the cars data set from [23] is used, which is also available in [24] (“cu.summary” data file). The data

contains dimensions and mechanical specifications for 111 different cars, supplied by the manufacturers and measured

for consumer reports. The variables are: y — weight, pounds; x1 — length overall, inches; x2 — wheel base length,

inches; x3 — width, inches; x4 — height, inches; x5 — front–head distance between the car’s head-liner and the head

of a 5 ft 9 in. front seat passenger, inches; x6 — rear–head distance between the car’s head-liner and the head of a 5 ft

9 in. rear seat passenger, inches; x7 — front leg room maximum, inches; x8 — rear seating, the fore-and-aft room,

inches; x9 — front shoulder room, inches; x10 — rear shoulder room, inches; x11 — turning circle radius, feet; x12

— displacement of the engine, cubic inches; x13 — HP, the net horsepower; x14 — tank fuel refill capacity, gallons; x15 — HP revolution red line, or the maximum safe engine speed, rpm. The cars’ weights are used for the regression

modeling with the dimension and specification variables. Weight is an aggregate characteristic that has a strong impact

on other cumulative characteristics of a car, such as mileage per gallon, and price (their correlations with weight equal

−0.87 and 0.70, respectively). The aim of the modeling consists of the evaluation of the predictor contribution in theweight variation, which in turn can help to identify the better engineering solutions.

A general view of the main characteristics of regressions is presented in Fig. 1, where various characteristics are

traced by means of the k parameter of ridge regressions. Three variants of the Ridge-2 regressions are presented: with

the tuning parameter q defined by the criteria of maximum coefficient of multiple determination (26), of minimum

variance (39), and of minimum GCV (40). In the curves in Fig. 1, these criteria are denoted as maxR2, minVar , andminGCV . In Fig. 1(A) we see that these three parameters q behave very similarly — they have similar positive constant

slopes (31) as k increases, while for Ridge-1 this parameter is constant, q = 1. The curve for q defined by the maxR2

criterion is higher than the other curves.

In Fig. 1(B) the effective number of parameters is shown. The value of n∗ for Ridge-1 is diminishing, by (17),

while for Ridge-2 it stabilizes, in accordance with (33), which in this case is n∗2

=n/λmax

=15/7.16

=2.09. The

best performance corresponds to the q curve defined by the maxR2 criterion. In Fig. 1(C) the residual error variance




(A) Tuning parameter q. (B) Effective number of parameters.

(C) Residual error variance. (D) Coeff. of multiple determination.

(E) Quotient (beta∗r /R2). (F) Bias of beta estimates.

(G) Variance of beta estimates. (H) Generalized cross-validation.

Fig. 1. Ridge trace of the main regression characteristics.

(18) and (34) (with the effective number of parameters) is shown — it grows quickly for Ridge-1 but becomes stable

for Ridge-2 models. The lowest error variance corresponds to the curve of q defined by the maxR2 criterion.

In Fig. 1(D) the coefficients of multiple determination are presented — they go to the constant asymptote (31)

slightly below the maximum value for Ridge-2, but decrease hyperbolically, (16), for Ridge-1. The best values of the

multiple determination are obtained, naturally, with q defined by the maxR2 criterion. Fig. 1(E) shows the quotient of




the scalar product br divided by the value of the coefficient of multiple determination. Only for the Ridge-2 model

with q defined by maxR2 is this ratio due to (28) identically equal to one. For the other two variants of q with the

minimum variance or GCV, this ratio is less than one but still constant, while for Ridge-1 with inequality (14), this ratio

decreases as k grows. This means that Ridge-2 with q, (27), and Ridge-1 have the best and the worst orthogonality

features, respectively.

Fig. 1(F) presents the bias of the coefficient estimates (20) and (36) for Ridge-1 and Ridge-2 — again, this

characteristic is worsening for the former and is fairly stable for the latter model. Ridge-2 with q defined by maxR2

demonstrates the best results. The next plot, Fig. 1(G), shows the variance of the coefficient estimates (constructed

with the effective number of parameters) — the Ridge-1 variance (21) is quickly growing while the Ridge-2 variance

(38) reaches a constant asymptote. The Ridge-2 curves with q defined by minVar and minGCV have the smallest

coefficient variances, and the curve of maxR2 is very close to them.

And Fig. 1(H) presents the GCV (23) and (40) behaviors — after the minimum, GCV goes speedily up for the

Ridge-1 model, but it stays at practically the same level for the Ridge-2 regressions. This means that the unbiased

estimate of the residual error is worsening for Ridge-1 but is consistent for Ridge-2 along the parameter k profiling.

Here too the Ridge-2 curves with q defined by minVar and minGCV have the smallest unbiased residual variances,

although the curve of maxR2 is practically indistinguishable from them.

Fig. 2 presents the ridge profiles for each of the beta coefficients of the regressions with fifteen regressors. We

see that for each beta coefficient the curves of Ridge-2 with q taken according to the criteria of maxR2, minVar , andminGCV are practically indistinguishable, while Ridge-1 is always distant from them. Along k , the Ridge-2 profiles

of beta coefficients go to constant asymptotes eventually proportional to the paired correlation of y with each x (30),

while the Ridge-1 betas (15) promptly diminish to zero (horizontal line). Several variables ( x5, x8, x9, and x12) have

negative signs for k = 0, or for the OLS model. These signs of the coefficients in multiple regressions are opposite

to the signs of the paired correlations of these x’s with y. At k = 0.05, two of these regressors, x9 and x12, already

have a positive relation to the dependent variable. Eventually, x5 attains a positive coefficient at k = 1.95; then x8 also

becomes positive, beginning at k = 6.6. The last variable x15 is negatively correlated with the dependent variable, so

its coefficient stays negative all along the trace.

The last plot, Fig. 3, presents the ridge trace of the absolute value of the t -statistics for the beta coefficients of

regressions. Again, the results are noticeably better for Ridge-2 than for Ridge-1 models. This is true for two reasons.

First, the beta coefficients themselves have higher absolute values for Ridge-2 than for Ridge-1 regressions. Second,the standard errors of beta coefficients, being proportional to the regression residual errors

√ 1 − R2, are smaller for

Ridge-2 with the higher value of multiple determination than for Ridge-1. For instance, taking k = 10, the Ridge-2

and Ridge-1 coefficients of multiple determination are ˜ R2 ≈ 0.88 and R2rd ≈ 0.55 (Fig. 1(D)). The t -statistics are

reciprocally proportional to the standard errors of the models, and those for the Ridge-2 and Ridge-1 models are

related as

(1 − R2

rd)/(1 − ˜ R2) ≈ 1.9. So the t -values for the beta coefficients are approximately twice as large for

Ridge-2 as for Ridge-1 — see Fig. 3. Most of the coefficient t -value curves go above the level of the critical t -value

(the horizontal line at the value 1.96 corresponding to the 95% confidence level for a two-tailed normal distribution or

approximately for t -statistics for the number of degrees of freedom above a hundred). Only the x8 variable eventually

becomes insignificant in all the models.

Considering all the characteristics described above, we prefer those with q defined by the maxR2 criterion, (27),

since it covers most of the optimal qualities, performs the best as indicated by the coefficient of multiple determination,and possesses the property of orthogonality (28). In some numerical runs the minVar and minGCV criteria could yield

unfeasible solutions, such as a negative q for a particular k value, while the maxR2 criterion (27) works well, always

producing positive q. Thus, the q obtained using the maxR2 criterion is recommended for practical applications.

Table 1 presents results for the regular OLS and Ridge-2 regressions. First, the pair correlations of the dependent

variable with the regressors are shown. In the next three columns the OLS beta coefficients (3), together with their

t -statistics and the net effects, are presented. The variables x5, x8, x9, x12 have opposite signs in their relationship

with the dependent variable in the multiple regression in comparison with correlations because of multicollinearity.

The total of the net effects equals the coefficient (5) of multiple determination R2 = 0.95, and the F -statistic equals

117.7, so the quality of the model is high. The results of Ridge-2 modeling are given on the right-hand side of Table 1.

The choice of q made using the criterion (27). The solution for multiple regression with coefficients of the same signs

as their correlations with the dependent variable is reached with the ridge parameter k = 6.6, and the corresponding




Fig. 2. Ridge trace of the beta coefficients.

tuning parameter is q = 1.98. The next two columns after the Ridge-2 beta coefficients contain their t -statistics (these

are noticeably higher than in the OLS regression), and deviations from orthogonality, (6), which are close to zero. The

last column shows the net effects and their total, equal to the coefficient (28) of multiple determination ˜ R2 = 0.88 (the

F -statistic is 44.9), which is close to the quality of the OLS regression. All net effects are positive and can be compared

according to their contributions. For the corresponding Ridge-1 the inequality (14) is numerically 0.666 = 0.445.




Fig. 3. Ridge trace of the t -statistics for the beta coefficients.

The ratio of coefficients of multiple determination of Ridge-2 and Ridge-1 is 1.33, so the Ridge-2 model performs

better by 33%.

To estimate the stability of the regression quality, the random sampling of 80 observations out of 100 available

observations was applied to a hundred subsets, and the coefficients of multiple determination for the OLS, Ridge-1,

and Ridge-2 models were evaluated. The results are presented on the left-hand side of Table 2. The regular regression

has the highest R2, the Ridge-2 regression follows it closely, and Ridge-1 has noticeably worse results. Also, about




Table 1

Cars data: OLS and Ridge-2 regressions

Variable OLS Ridge-2

Cor Beta t -Stat. Net eff. Beta t -Stat. Orthog. Net eff.

x1 Length .81 .36 5.9 .29 .12 18.2 −.01 .10

x2

Wheel .80 .11 1.7 .09 .11 17.8−

.04 .09

x3 Width .86 .19 2.3 .16 .13 20.2 −.02 .11

x4 Height .44 .32 5.9 .14 .07 9.0 .06 .04

x5 Front.Hd .25 −.05 −1.5 −.01 .02 2.6 −.11 .00

x6 Rear.Hd .34 .01 0.2 .00 .05 6.1 −.03 .01

x7 Front.Leg .31 .02 0.8 .01 .05 5.5 .02 .02

x8 Rear.Seat .15 −.18 −3.6 −.03 .00 0.0 −.15 .00

x9 Front.Shld .82 −.03 −0.4 −.02 .12 18.6 −.02 .09

x10 Rear.Shld .21 .04 0.9 .01 .02 2.2 −.08 .00

x11 Turning .77 .04 0.9 .03 .11 15.7 −.03 .08

x12 Disp. .85 −.09 −1.0 −.07 .13 19.6 .00 .11

x13 HP .70 .35 5.7 .25 .12 15.5 .10 .10

x14 Tank .85 .08 1.5 .07 .13 19.2 .04 .12

x15 HP.revs −.48 −.08 −1.4 .04 −.06 −7.3 .09 .02

R2 .95 .88

Table 2

Cars data: sampling for R2 and the residual root mean square error

Statistic R2 RMSE

OLS k Ridge-1 q Ridge-2 OLS k Ridge-1 q Ridge-2

Min .94 1.50 .66 1.25 .87 135.2 1.50 179.4 1.27 160.6

Max .97 7.00 .88 2.02 .93 302.2 6.50 353.3 1.91 241.1

Mean .95 3.68 .78 1.56 .90 172.3 3.67 270.5 1.56 200.3

Std .01 1.35 .06 0.19 .01 31.9 1.45 41.0 0.21 21.0

half of the observations were randomly picked out, and the other half were used in the forecasting of the dependent

variable using the models. Estimations of the residual root mean square error (RMSE) are presented on the right-

hand side of Table 2. Again, the Ridge-2 results are close to the OLS model, having practically the same predictive

power. But unlike the OLS regression, the Ridge-2 one has all coefficients of regression interpretable, with positive

net effects.

For the second example, a data set from a real study of customer satisfaction with a bank is used. The variables are:

y — overall satisfaction with the bank, x1 — likelihood of using the bank in future, x2 — likelihood of recommending

the bank, x3 — satisfaction with the value provided, x4 — satisfaction with loan experience, x5 — service best fitted

to needs, x6 — feeling like a valued customer, x7 — approval time versus expectations, x8 — kept informed on

financial status, x9 — promptness of communication, x10 — satisfaction with problem processing. There are about

5600 observations, and all the variables are correlated positively. The predictor contribution in the explanation of thedependent variable variance was investigated in this research.

Table 3 (arranged similarly to Table 1) shows that the coefficient of multiple determination R2 = 0.64 (the

F -statistic equals 980), so it is a model of a high quality. However, the variables x7 and x9 have negative net effects,

although each regressor increases the coefficient of determination, which is a multicollinearity effect. The Ridge-2

model reached all positive coefficients with the ridge parameter k = 0.07, and the corresponding second parameter

is q = 1.018. The next columns contain t -statistics for the coefficients, orthogonality values which are very close to

zero, and the net effects and their total, equal to the coefficient of multiple determination ˜ R2 = 0.63 (the F -statistic

is 977). These results are very close to the regular regression, but all have positive net effects that can be used for

comparison of the predictors by means of their contributions to the dependent variable.

To estimate the stability of the regressions, various sampling evaluations were performed. About 400 random

samples of 100 observations were used to evaluate coefficients of multiple determination for the OLS, Ridge-1, and




Table 3

Bank data: OLS and Ridge-2 regressions

Variable OLS Ridge-2

Cor Beta t -Stat. Net eff. Beta t -Stat. Orthog. Net eff.

x1 In Future .53 .02 2.1 .01 .04 3.6 −.01 .02

x2

Recommend .61 .10 8.4 .06 .11 10.0−

.00 .06

x3 Value .71 .18 11.7 .13 .17 14.2 −.00 .12

x4 Loan .72 .29 21.9 .22 .27 24.0 .01 .20

x5 Fit Needs .60 .05 4.1 .03 .06 5.9 −.01 .04

x6 Feeling .70 .22 16.3 .15 .21 18.4 .00 .14

x7 Timely .32 −.01 −0.6 −.00 .00 0.0 −.01 .00

x8 Info .60 .04 2.8 .02 .05 4.3 −.01 .03

x9 Communicate .53 −.01 −0.5 −.00 .01 0.54 −.01 .00

x10 Problems .57 .03 2.6 .02 .04 4.2 −.01 .02

R2 .64 .63

Table 4

Bank data: sampling for R2

and residual root mean square error

Statistic R2 RMSE

OLS k Ridge-1 q Ridge-2 OLS k Ridge-1 q Ridge-2

Min .44 .00 .40 1.00 .43 .86 .00 .86 1.00 .85

Max .84 4.55 .82 1.84 .82 .96 .48 .96 1.10 .96

Mean .67 .67 .64 1.14 .65 .90 .14 .90 1.03 .90

Std .08 .56 .08 0.11 .08 .02 .11 .02 0.02 .02

Ridge-2 models. Table 4 containing descriptive statistics (arranged similarly to Table 2) shows that the R2 results

for regular and ridge models are close; however, in each particular data subset all regular regressions have negative

signs not only for x7 and x9 but also for all of the other predictors. At the same time, the ridge regressions all have

positive coefficients corresponding to the chosen level of the parameter k . Also, half (about 2800) of the observationswere randomly picked up for constructing models, and the other half of the data set was used for the forecasting of

the dependent variable by means of the models and for estimation of the residual root mean square error. Table 4

also presents the RMSE results obtained in 100 samplings using the OLS and ridge regressions. Again, Ridge-2

demonstrates high quality and gains practically the same predicting power as the linear model, having at the same

time interpretable coefficients and positive inputs from all the predictors.

6. Summary

One-parameter ridge regression uses a regularization term to reduce the influence of multicollinearity on the

regression coefficients. The two-parameter ridge regression considered uses an additional term for tuning the quality

of the model to the highest possible level. In contrast to the one-parameter ridge model, the two-parameter ridgeregression has such useful properties as orthogonality between the variables and residual errors, less bias and a more

efficient solution, higher precision of fitness, and it encompasses other helpful features of linear regression. The results

of the two-parameter ridge regression are robust, not prone to multicollinearity, and are easily interpretable. With

increase of the ridge parameter the model eventually reaches the asymptotic level with beta coefficients of multiple

regression proportional to the coefficients of pair regressions, which guarantees positive shares of the variables’

individual roles in the model. The suggested approach would be useful for theoretical consideration and could serve

in numerous practical aims of regression modeling and analysis.

Acknowledgements

The author thanks the reviewers for their suggestions that improved the paper.




References

[1] A. Grapentine, Managing multicollinearity, Journal of Marketing Research 9 (1997) 11–21.

[2] C.H. Mason, W.D. Perreault, Collinearity, power, and interpretation of multiple regression analysis, Journal of Marketing Research 28 (1991)

268–280.

[3] A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67.

[4] A.E. Hoerl, R.W. Kennard, Ridge Regression, in: S. Kotz, N.L. Johnson (Eds.), Encyclopedia of Statistical Sciences, vol. 8, 1988, pp. 129–136.

[5] A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics 42 (2000) 80–86.

[6] A.E. Hoerl, R.W. Kennard, K.F. Baldwin, Ridge regression: some simulation, Communications in Statistics A4 (1975) 105–124.

[7] P.J. Brown, Measurement, Regression and Calibration, Oxford University Press, Oxford, 1994.

[8] J.S. Chipman, Linear restrictions, rank reduction, and biased estimation in linear regression, Linear Algebra and its Applications 289 (1999)

55–74.

[9] S. Lipovetsky, M. Conklin, Multiobjective regression modifications for collinearity, Computers and Operations Research 28 (2001)

1333–1345.

[10] S. Lipovetsky, M. Conklin, Enhance-synergism and suppression effects in multiple regression, Mathematical Education in Science and

Technology 35 (2004) 391–402.

[11] S. Lipovetsky, M. Conklin, Decision making by variable contribution in discriminant, logit, and regression analyses, Information Technology

and Decision Making 3 (2004) 265–279.

[12] S. Lipovetsky, M. Conklin, Regression by data segments via discriminant analysis, Journal of Modern Applied Statistical Methods 4 (2005)

63–74.

[13] G.H. Golub, M. Heath, G. Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979)215–223.

[14] P. Craven, G. Wahba, Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized

cross-validation, Numerical Mathematics 31 (1979) 317–403.

[15] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York,

2001.

[16] D.M. Hawkins, X. Yin, A faster algorithm for ridge regression of reduced rank data, Computational Statistics & Data Analysis 40 (2002)

253–262.

[17] R.J. Santos, A.R. De Pierro, A cheaper way to compute Generalized Cross-Validation as a stopping rule for linear stationary iterative methods,

Journal of Computational and Graphical Statistics 12 (2003) 417–433.

[18] S. Lipovetsky, M. Conklin, Dual- and triple-mode matrix approximation and regression modelling, Applied Stochastic Models in Business

and Industry 19 (2003) 291–301.

[19] S. Lipovetsky, M. Conklin, Multi-objective modeling and two-parameter ridge regression, in: Proceedings of Joint Statistical Meeting of ASA,

Toronto, Canada, 2004, pp. 2511–2517.

[20] S. Lipovetsky, M. Conklin, Ridge regression in two parameter solution, Applied Stochastic Models in Business and Industry 21 (2005)525–540.

[21] B.D. Craven, Mathematical Programming and Control Theory, Chapman and Hall, London, 1978.

[22] R. Redheffer, All solutions in the unit cube, The American Mathematical Monthly 107 (2000) 868–869.

[23] J.M. Chambers, T.J. Hastie, Statistical Models in S, Wadsworth and Brooks, Pacific Grove, CA, 1992.

[24] MathSoft, S-PLUS’2000, Seattle, WA, 1999.

two-parameter ridge regression and its convergence to the eventual pairwise model

Documents