Chapter 6: Linear Model Selection and Regularization

CONCEPTUAL

EXERCISE 1:

Part a)

Best subset will have the smallest train RSS because the models will optimize on the training RSS and best subset will try every model that forward and backward selection will try.

Part b)

The best test RSS model could be any of the three. Best subset could easily overfit if the data has large \(p\) predictors relative to \(n\) observations. Forward and backward selection might not converge on the same model but try the same number of models and hard to say which selection process would be better.

Part c)

1. TRUE
1. TRUE
1. FALSE
1. FALSE
1. FALSE

EXERCISE 2:

Part a)

is TRUE - lasso puts a budget constraint on least squares (less flexible)

Part b)

is TRUE - ridge also puts a budget constraint on least squares (less flexible)

Part c)

is TRUE - a non-linear model would be more flexible and have higher variance, less bias

EXERCISE 3:

Part a)

is TRUE - as \(s\) is increased, there is less and less constraint on the model and it should always have a better training error (if \(s\) is increased to \(s'\), then the best model using a budget of \(s\) would be included when using a budget of \(s'\))

Part b)

is TRUE - test error will improve (decrease) to a point and then will worsen (increase) as constraints loosen and model overfits

Part c)

is TRUE - variance always increases with fewer constraints

Part d)

is TRUE - bias always decreases with more model flexibility

Part e)

is TRUE - the irreducible error is a constant value, not related to model selection

EXERCISE 4:

This problem is similar to Excercise 3, but for ridge instead of lasso and using \(\lambda\) instead of \(s\). For each question part, ridge and lasso should be the same directionally except that increasing \(\lambda\) puts a heavier penalty in the equation, equivalent to reducing the budget \(s\), so the answers to Exercise 4 should be flipped (horizontally) from answers in Exercise 3.

Part a)

is TRUE - training error increases steadily

Part b)

is TRUE - test error will decrease initially and then increase

Part c)

is TRUE - variance always decrease with more constraints

Part d)

is TRUE - bias always increase with less model flexibility

Part e)

is TRUE - the irreducible error is a constant value, not related to model selection

EXERCISE 5:

Part a)

Ridge: minimize \((y_1 - \hat\beta_1x_{11} - \hat\beta_2x_{12})^2 + (y_2 - \hat\beta_1x_{21} - \hat\beta_2x_{22})^2 + \lambda (\hat\beta_1^2 + \hat\beta_2^2)\)

Part b)

Step 1: Expanding the equation from Part a:

\[(y_1 - \hat\beta_1 x_{11} - \hat\beta_2 x_{12})^2 + (y_2 - \hat\beta_1 x_{21} - \hat\beta_2 x_{22})^2 + \lambda (\hat\beta_1^2 + \hat\beta_2^2) \\ = (y_1^2 + \hat\beta_1^2 x_{11}^2 + \hat\beta_2^2 x_{12}^2 - 2 \hat\beta_1 x_{11} y_1 - 2 \hat\beta_2 x_{12} y_1 + 2 \hat\beta_1 \hat\beta_2 x_{11} x_{12}) \\ + (y_2^2 + \hat\beta_1^2 x_{21}^2 + \hat\beta_2^2 x_{22}^2 - 2 \hat\beta_1 x_{21} y_2 - 2 \hat\beta_2 x_{22} y_2 + 2 \hat\beta_1 \hat\beta_2 x_{21} x_{22}) \\ + \lambda \hat\beta_1^2 + \lambda \hat\beta_2^2\]

Step 2: Taking the partial deritive to \(\hat\beta_1\) and setting equation to 0 to minimize:

\[\frac{\partial }{\partial \hat\beta_1}: (2\hat\beta_1x_{11}^2-2x_{11}y_1+2\hat\beta_2x_{11}x_{12}) + (2\hat\beta_1x_{21}^2-2x_{21}y_2+2\hat\beta_2x_{21}x_{22}) + 2\lambda\hat\beta_1 = 0\]

Step 3: Setting \(x_{11}=x_{12}=x_1\) and \(x_{21}=x_{22}=x_2\) and dividing both sides of the equation by 2:

\[(\hat\beta_1x_1^2-x_1y_1+\hat\beta_2x_1^2) + (\hat\beta_1x_2^2-x_2y_2+\hat\beta_2x_2^2) + \lambda\hat\beta_1 = 0\]

\[\hat\beta_1 (x_1^2+x_2^2) + \hat\beta_2 (x_1^2+x_2^2) + \lambda\hat\beta_1 = x_1y_1 + x_2y_2\]

Step 4: Add \(2\hat\beta_1x_1x_2\) and \(2\hat\beta_2x_1x_2\) to both sides of the equation:

\[\hat\beta_1 (x_1^2 + x_2^2 + 2x_1x_2) + \hat\beta_2 (x_1^2 + x_2^2 + 2x_1x_2) + \lambda\hat\beta_1 = x_1y_1 + x_2y_2 + 2\hat\beta_1x_1x_2 + 2\hat\beta_2x_1x_2 \\ \hat\beta_1 (x_1 + x_2)^2 + \hat\beta_2 (x_1 + x_2)^2 + \lambda\hat\beta_1 = x_1y_1 + x_2y_2 + 2\hat\beta_1x_1x_2 + 2\hat\beta_2x_1x_2\]

Step 5: Because \(x_1+x_2=0\), we can eliminate the first two terms:

\[\lambda\hat\beta_1 = x_1y_1 + x_2y_2 + 2\hat\beta_1x_1x_2 + 2\hat\beta_2x_1x_2\]

Step 6: Similarly by taking the partial deritive to \(\hat\beta_2\), we can get the equation:

\[\lambda\hat\beta_2 = x_1y_1 + x_2y_2 + 2\hat\beta_1x_1x_2 + 2\hat\beta_2x_1x_2\]

Step 7: The left side of the equations for both \(\lambda\hat\beta_1\) and \(\lambda\hat\beta_2\) are the same so we have:

\[\lambda\hat\beta_1 = \lambda\hat\beta_2\]

\[\hat\beta_1 = \hat\beta_2\]

Part c)

Lasso: minimize \((y_1 - \hat\beta_1x_{11} - \hat\beta_2x_{12})^2 + (y_2 - \hat\beta_1x_{21} - \hat\beta_2x_{22})^2 + \lambda (|\hat\beta_1| + |\hat\beta_2|)\)

Part d)

Replacing the constraint term from Part b, the derivative term to \(\beta\) is:

\[\frac{\partial }{\partial \hat\beta} (\lambda |\beta|): \lambda\frac{|\beta|}{\beta}\]

Following through the steps in Part b, we get:

\[\lambda\frac{|\beta_1|}{\beta_1} = \lambda\frac{|\beta_2|}{\beta_2}\]

So it seems that the lasso just requires that \(\beta_1\) and \(\beta_2\) are both positive or both negative (ignoring possibility of 0…)

EXERCISE 6:

Part a)

betas <- seq(-10,10,0.1)
eq.ridge <- function(beta, y=7, lambda=10) (y-beta)^2 + lambda*beta^2
plot(betas, eq.ridge(betas), xlab="beta", main="Ridge Regression Optimization", pch=1)
points(5/(1+10), eq.ridge(7/(1+10)), pch=16, col="red", cex=2)

For \(y=7\) and \(\lambda=10\), \(\hat\beta=\frac{7}{1+10}\) minimizes the ridge regression equation

Part b)

betas <- seq(-10,10,0.1)
eq.lasso <- function(beta, y=7, lambda=10) (y-beta)^2 + lambda*abs(beta)
plot(betas, eq.lasso(betas), xlab="beta", main="Lasso Regression Optimization", pch=1)
points(7-10/2, eq.lasso(7-10/2), pch=16, col="red", cex=2)

For \(y=7\) and \(\lambda=10\), \(\hat\beta=7-\frac{10}{2}\) minimizes the ridge regression equation

EXERCISE 7:

Part a)