EXERCISE 1:
Part a)
Best subset will have the smallest train RSS because the models will optimize on the training RSS and best subset will try every model that forward and backward selection will try.
Part b)
The best test RSS model could be any of the three. Best subset could easily overfit if the data has large \(p\) predictors relative to \(n\) observations. Forward and backward selection might not converge on the same model but try the same number of models and hard to say which selection process would be better.
Part c)
EXERCISE 2:
Part a)
Part b)
Part c)
EXERCISE 3:
Part a)
Part b)
Part c)
Part d)
Part e)
EXERCISE 4:
This problem is similar to Excercise 3, but for ridge instead of lasso and using \(\lambda\) instead of \(s\). For each question part, ridge and lasso should be the same directionally except that increasing \(\lambda\) puts a heavier penalty in the equation, equivalent to reducing the budget \(s\), so the answers to Exercise 4 should be flipped (horizontally) from answers in Exercise 3.
Part a)
Part b)
Part c)
Part d)
Part e)
EXERCISE 5:
Part a)
Ridge: minimize \((y_1 - \hat\beta_1x_{11} - \hat\beta_2x_{12})^2 + (y_2 - \hat\beta_1x_{21} - \hat\beta_2x_{22})^2 + \lambda (\hat\beta_1^2 + \hat\beta_2^2)\)
Part b)
Step 1: Expanding the equation from Part a:
\[(y_1 - \hat\beta_1 x_{11} - \hat\beta_2 x_{12})^2 + (y_2 - \hat\beta_1 x_{21} - \hat\beta_2 x_{22})^2 + \lambda (\hat\beta_1^2 + \hat\beta_2^2) \\ = (y_1^2 + \hat\beta_1^2 x_{11}^2 + \hat\beta_2^2 x_{12}^2 - 2 \hat\beta_1 x_{11} y_1 - 2 \hat\beta_2 x_{12} y_1 + 2 \hat\beta_1 \hat\beta_2 x_{11} x_{12}) \\ + (y_2^2 + \hat\beta_1^2 x_{21}^2 + \hat\beta_2^2 x_{22}^2 - 2 \hat\beta_1 x_{21} y_2 - 2 \hat\beta_2 x_{22} y_2 + 2 \hat\beta_1 \hat\beta_2 x_{21} x_{22}) \\ + \lambda \hat\beta_1^2 + \lambda \hat\beta_2^2\]
Step 2: Taking the partial deritive to \(\hat\beta_1\) and setting equation to 0 to minimize:
\[\frac{\partial }{\partial \hat\beta_1}: (2\hat\beta_1x_{11}^2-2x_{11}y_1+2\hat\beta_2x_{11}x_{12}) + (2\hat\beta_1x_{21}^2-2x_{21}y_2+2\hat\beta_2x_{21}x_{22}) + 2\lambda\hat\beta_1 = 0\]
Step 3: Setting \(x_{11}=x_{12}=x_1\) and \(x_{21}=x_{22}=x_2\) and dividing both sides of the equation by 2:
\[(\hat\beta_1x_1^2-x_1y_1+\hat\beta_2x_1^2) + (\hat\beta_1x_2^2-x_2y_2+\hat\beta_2x_2^2) + \lambda\hat\beta_1 = 0\]
\[\hat\beta_1 (x_1^2+x_2^2) + \hat\beta_2 (x_1^2+x_2^2) + \lambda\hat\beta_1 = x_1y_1 + x_2y_2\]
Step 4: Add \(2\hat\beta_1x_1x_2\) and \(2\hat\beta_2x_1x_2\) to both sides of the equation:
\[\hat\beta_1 (x_1^2 + x_2^2 + 2x_1x_2) + \hat\beta_2 (x_1^2 + x_2^2 + 2x_1x_2) + \lambda\hat\beta_1 = x_1y_1 + x_2y_2 + 2\hat\beta_1x_1x_2 + 2\hat\beta_2x_1x_2 \\ \hat\beta_1 (x_1 + x_2)^2 + \hat\beta_2 (x_1 + x_2)^2 + \lambda\hat\beta_1 = x_1y_1 + x_2y_2 + 2\hat\beta_1x_1x_2 + 2\hat\beta_2x_1x_2\]
Step 5: Because \(x_1+x_2=0\), we can eliminate the first two terms:
\[\lambda\hat\beta_1 = x_1y_1 + x_2y_2 + 2\hat\beta_1x_1x_2 + 2\hat\beta_2x_1x_2\]
Step 6: Similarly by taking the partial deritive to \(\hat\beta_2\), we can get the equation:
\[\lambda\hat\beta_2 = x_1y_1 + x_2y_2 + 2\hat\beta_1x_1x_2 + 2\hat\beta_2x_1x_2\]
Step 7: The left side of the equations for both \(\lambda\hat\beta_1\) and \(\lambda\hat\beta_2\) are the same so we have:
\[\lambda\hat\beta_1 = \lambda\hat\beta_2\]
\[\hat\beta_1 = \hat\beta_2\]
Part c)
Lasso: minimize \((y_1 - \hat\beta_1x_{11} - \hat\beta_2x_{12})^2 + (y_2 - \hat\beta_1x_{21} - \hat\beta_2x_{22})^2 + \lambda (|\hat\beta_1| + |\hat\beta_2|)\)
Part d)
Replacing the constraint term from Part b, the derivative term to \(\beta\) is:
\[\frac{\partial }{\partial \hat\beta} (\lambda |\beta|): \lambda\frac{|\beta|}{\beta}\]
Following through the steps in Part b, we get:
\[\lambda\frac{|\beta_1|}{\beta_1} = \lambda\frac{|\beta_2|}{\beta_2}\]
So it seems that the lasso just requires that \(\beta_1\) and \(\beta_2\) are both positive or both negative (ignoring possibility of 0…)
EXERCISE 6:
Part a)
betas <- seq(-10,10,0.1)
eq.ridge <- function(beta, y=7, lambda=10) (y-beta)^2 + lambda*beta^2
plot(betas, eq.ridge(betas), xlab="beta", main="Ridge Regression Optimization", pch=1)
points(5/(1+10), eq.ridge(7/(1+10)), pch=16, col="red", cex=2)
For \(y=7\) and \(\lambda=10\), \(\hat\beta=\frac{7}{1+10}\) minimizes the ridge regression equation
Part b)
betas <- seq(-10,10,0.1)
eq.lasso <- function(beta, y=7, lambda=10) (y-beta)^2 + lambda*abs(beta)
plot(betas, eq.lasso(betas), xlab="beta", main="Lasso Regression Optimization", pch=1)
points(7-10/2, eq.lasso(7-10/2), pch=16, col="red", cex=2)
For \(y=7\) and \(\lambda=10\), \(\hat\beta=7-\frac{10}{2}\) minimizes the ridge regression equation
EXERCISE 7:
Part a)
[… will come back to this. maybe.]
Part b)
[… will come back to this. maybe.]
Part c)
[… will come back to this. maybe.]
Part d)
[… will come back to this. maybe.]
Part e)
[… will come back to this. maybe.]
EXERCISE 8:
Part a)
set.seed(1)
X <- rnorm(100)
eps <- rnorm(100)
Part b)
Y <- 5 + 10*X - 3*X^2 + 2*X^3 + eps
plot(X,Y)
Part c)
require(leaps)
regfit.full <- regsubsets(Y~poly(X,10,raw=T), data=data.frame(Y,X), nvmax=10)
(reg.summary <- summary(regfit.full))
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = T), data = data.frame(Y,
## X), nvmax = 10)
## 10 Variables (and intercept)
## Forced in Forced out
## poly(X, 10, raw = T)1 FALSE FALSE
## poly(X, 10, raw = T)2 FALSE FALSE
## poly(X, 10, raw = T)3 FALSE FALSE
## poly(X, 10, raw = T)4 FALSE FALSE
## poly(X, 10, raw = T)5 FALSE FALSE
## poly(X, 10, raw = T)6 FALSE FALSE
## poly(X, 10, raw = T)7 FALSE FALSE
## poly(X, 10, raw = T)8 FALSE FALSE
## poly(X, 10, raw = T)9 FALSE FALSE
## poly(X, 10, raw = T)10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 1 ( 1 ) "*" " "
## 2 ( 1 ) "*" "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)5 poly(X, 10, raw = T)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) " " " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)7 poly(X, 10, raw = T)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) " " "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)9 poly(X, 10, raw = T)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) " " "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
par(mfrow=c(3,1))
min.cp <- which.min(reg.summary$cp)
plot(reg.summary$cp, xlab="Number of Poly(X)", ylab="Best Subset Cp", type="l")
points(min.cp, reg.summary$cp[min.cp], col="red", pch=4, lwd=5)
min.bic <- which.min(reg.summary$bic)
plot(reg.summary$bic, xlab="Number of Poly(X)", ylab="Best Subset BIC", type="l")
points(min.bic, reg.summary$bic[min.bic], col="red", pch=4, lwd=5)
min.adjr2 <- which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Number of Poly(X)", ylab="Best Subset Adjusted R^2", type="l")
points(min.adjr2, reg.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5)
coef(regfit.full, min.cp)
## (Intercept) poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 5.07200775 10.38745596 -3.15424359
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)5
## 1.55797426 0.08072292
coef(regfit.full, min.bic)
## (Intercept) poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 5.061507 9.975280 -3.123791
## poly(X, 10, raw = T)3
## 2.017639
coef(regfit.full, min.adjr2)
## (Intercept) poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 5.07200775 10.38745596 -3.15424359
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)5
## 1.55797426 0.08072292
Part d)
# forward selection
regfit.fwd <- regsubsets(Y~poly(X,10,raw=T), data=data.frame(Y,X), nvmax=10)
(fwd.summary <- summary(regfit.fwd))
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = T), data = data.frame(Y,
## X), nvmax = 10)
## 10 Variables (and intercept)
## Forced in Forced out
## poly(X, 10, raw = T)1 FALSE FALSE
## poly(X, 10, raw = T)2 FALSE FALSE
## poly(X, 10, raw = T)3 FALSE FALSE
## poly(X, 10, raw = T)4 FALSE FALSE
## poly(X, 10, raw = T)5 FALSE FALSE
## poly(X, 10, raw = T)6 FALSE FALSE
## poly(X, 10, raw = T)7 FALSE FALSE
## poly(X, 10, raw = T)8 FALSE FALSE
## poly(X, 10, raw = T)9 FALSE FALSE
## poly(X, 10, raw = T)10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 1 ( 1 ) "*" " "
## 2 ( 1 ) "*" "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)5 poly(X, 10, raw = T)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) " " " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)7 poly(X, 10, raw = T)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) " " "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)9 poly(X, 10, raw = T)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) " " "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
# backward selection
regfit.bwd <- regsubsets(Y~poly(X,10,raw=T), data=data.frame(Y,X), nvmax=10)
(bwd.summary <- summary(regfit.bwd))
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = T), data = data.frame(Y,
## X), nvmax = 10)
## 10 Variables (and intercept)
## Forced in Forced out
## poly(X, 10, raw = T)1 FALSE FALSE
## poly(X, 10, raw = T)2 FALSE FALSE
## poly(X, 10, raw = T)3 FALSE FALSE
## poly(X, 10, raw = T)4 FALSE FALSE
## poly(X, 10, raw = T)5 FALSE FALSE
## poly(X, 10, raw = T)6 FALSE FALSE
## poly(X, 10, raw = T)7 FALSE FALSE
## poly(X, 10, raw = T)8 FALSE FALSE
## poly(X, 10, raw = T)9 FALSE FALSE
## poly(X, 10, raw = T)10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 1 ( 1 ) "*" " "
## 2 ( 1 ) "*" "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)5 poly(X, 10, raw = T)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) " " " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)7 poly(X, 10, raw = T)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) " " "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)9 poly(X, 10, raw = T)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) " " "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
par(mfrow=c(3,2))
min.cp <- which.min(fwd.summary$cp)
plot(fwd.summary$cp, xlab="Number of Poly(X)", ylab="Forward Selection Cp", type="l")
points(min.cp, fwd.summary$cp[min.cp], col="red", pch=4, lwd=5)
min.cp <- which.min(bwd.summary$cp)
plot(bwd.summary$cp, xlab="Number of Poly(X)", ylab="Backward Selection Cp", type="l")
points(min.cp, bwd.summary$cp[min.cp], col="red", pch=4, lwd=5)
min.bic <- which.min(fwd.summary$bic)
plot(fwd.summary$bic, xlab="Number of Poly(X)", ylab="Forward Selection BIC", type="l")
points(min.bic, fwd.summary$bic[min.bic], col="red", pch=4, lwd=5)
min.bic <- which.min(bwd.summary$bic)
plot(bwd.summary$bic, xlab="Number of Poly(X)", ylab="Backward Selection BIC", type="l")
points(min.bic, bwd.summary$bic[min.bic], col="red", pch=4, lwd=5)
min.adjr2 <- which.max(fwd.summary$adjr2)
plot(fwd.summary$adjr2, xlab="Number of Poly(X)", ylab="Forward Selection Adjusted R^2", type="l")
points(min.adjr2, fwd.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5)
min.adjr2 <- which.max(bwd.summary$adjr2)
plot(bwd.summary$adjr2, xlab="Number of Poly(X)", ylab="Backward Selection Adjusted R^2", type="l")
points(min.adjr2, bwd.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5)
# coefficients of selected models
coef(regfit.fwd, which.min(fwd.summary$cp))
## (Intercept) poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 5.07200775 10.38745596 -3.15424359
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)5
## 1.55797426 0.08072292
coef(regfit.bwd, which.min(bwd.summary$cp))
## (Intercept) poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 5.07200775 10.38745596 -3.15424359
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)5
## 1.55797426 0.08072292
coef(regfit.fwd, which.min(fwd.summary$bic))
## (Intercept) poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 5.061507 9.975280 -3.123791
## poly(X, 10, raw = T)3
## 2.017639
coef(regfit.bwd, which.min(bwd.summary$bic))
## (Intercept) poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 5.061507 9.975280 -3.123791
## poly(X, 10, raw = T)3
## 2.017639
coef(regfit.fwd, which.max(fwd.summary$adjr2))
## (Intercept) poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 5.07200775 10.38745596 -3.15424359
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)5
## 1.55797426 0.08072292
coef(regfit.bwd, which.max(bwd.summary$adjr2))
## (Intercept) poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 5.07200775 10.38745596 -3.15424359
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)5
## 1.55797426 0.08072292
Best subset, foward selection and backward selection all resulted in the same best models
Part e)
require(glmnet)
xmat <- model.matrix(Y~poly(X,10,raw=T))[,-1]
lasso.mod <- cv.glmnet(xmat, Y, alpha=1)
(lambda <- lasso.mod$lambda.min)
## [1] 0.06392877
par(mfrow=c(1,1))
plot(lasso.mod)
predict(lasso.mod, s=lambda, type="coefficients")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.0197864204
## poly(X, 10, raw = T)1 10.1772066700
## poly(X, 10, raw = T)2 -3.0727099518
## poly(X, 10, raw = T)3 1.7336850893
## poly(X, 10, raw = T)4 .
## poly(X, 10, raw = T)5 0.0451934932
## poly(X, 10, raw = T)6 .
## poly(X, 10, raw = T)7 0.0002654849
## poly(X, 10, raw = T)8 .
## poly(X, 10, raw = T)9 .
## poly(X, 10, raw = T)10 .
Lasso regression selects the correct predictors: \(X\), \(X^2\) and \(X^3\)
Part f)
Y2 <- 5 + 1.5*X^7 + eps
# best subset model selection
regfit.full <- regsubsets(Y2~poly(X,10,raw=T), data=data.frame(Y,X), nvmax=10)
par(mfrow=c(3,1))
(reg.summary <- summary(regfit.full))
## Subset selection object
## Call: regsubsets.formula(Y2 ~ poly(X, 10, raw = T), data = data.frame(Y,
## X), nvmax = 10)
## 10 Variables (and intercept)
## Forced in Forced out
## poly(X, 10, raw = T)1 FALSE FALSE
## poly(X, 10, raw = T)2 FALSE FALSE
## poly(X, 10, raw = T)3 FALSE FALSE
## poly(X, 10, raw = T)4 FALSE FALSE
## poly(X, 10, raw = T)5 FALSE FALSE
## poly(X, 10, raw = T)6 FALSE FALSE
## poly(X, 10, raw = T)7 FALSE FALSE
## poly(X, 10, raw = T)8 FALSE FALSE
## poly(X, 10, raw = T)9 FALSE FALSE
## poly(X, 10, raw = T)10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " "*"
## 3 ( 1 ) " " "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)5 poly(X, 10, raw = T)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) " " "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)7 poly(X, 10, raw = T)8
## 1 ( 1 ) "*" " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(X, 10, raw = T)9 poly(X, 10, raw = T)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
min.cp <- which.min(reg.summary$cp)
plot(reg.summary$cp, xlab="Number of Poly(X)", ylab="Best Subset Cp", type="l")
points(min.cp, reg.summary$cp[min.cp], col="red", pch=4, lwd=5)
min.bic <- which.min(reg.summary$bic)
plot(reg.summary$bic, xlab="Number of Poly(X)", ylab="Best Subset BIC", type="l")
points(min.bic, reg.summary$bic[min.bic], col="red", pch=4, lwd=5)
min.adjr2 <- which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Number of Poly(X)", ylab="Best Subset Adjusted R^2", type="l")
points(min.adjr2, reg.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5)
coef(regfit.full, min.cp)
## (Intercept) poly(X, 10, raw = T)2 poly(X, 10, raw = T)7
## 5.0704904 -0.1417084 1.5015552
coef(regfit.full, min.bic)
## (Intercept) poly(X, 10, raw = T)7
## 4.95894 1.50077
coef(regfit.full, min.adjr2)
## (Intercept) poly(X, 10, raw = T)1 poly(X, 10, raw = T)2
## 5.0762524 0.2914016 -0.1617671
## poly(X, 10, raw = T)3 poly(X, 10, raw = T)7
## -0.2526527 1.5091338
# lasso regression
xmat <- model.matrix(Y2~poly(X,10,raw=T))[,-1]
lasso.mod <- cv.glmnet(xmat, Y2, alpha=1)
(lambda <- lasso.mod$lambda.min)
## [1] 2.910056
par(mfrow=c(1,1))
plot(lasso.mod)
predict(lasso.mod, s=lambda, type="coefficients")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.161575
## poly(X, 10, raw = T)1 .
## poly(X, 10, raw = T)2 .
## poly(X, 10, raw = T)3 .
## poly(X, 10, raw = T)4 .
## poly(X, 10, raw = T)5 .
## poly(X, 10, raw = T)6 .
## poly(X, 10, raw = T)7 1.452757
## poly(X, 10, raw = T)8 .
## poly(X, 10, raw = T)9 .
## poly(X, 10, raw = T)10 .
Lasso selects the correct model but best subset diagnostics indicate using 1 to 4 predictors
EXERCISE 9:
Part a)
require(ISLR)
data(College)
set.seed(1)
trainid <- sample(1:nrow(College), nrow(College)/2)
train <- College[trainid,]
test <- College[-trainid,]
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
Part b)
fit.lm <- lm(Apps~., data=train)
pred.lm <- predict(fit.lm, test)
(err.lm <- mean((test$Apps - pred.lm)^2)) # test error
## [1] 1108531
Part c)
require(glmnet)
xmat.train <- model.matrix(Apps~., data=train)[,-1]
xmat.test <- model.matrix(Apps~., data=test)[,-1]
fit.ridge <- cv.glmnet(xmat.train, train$Apps, alpha=0)
(lambda <- fit.ridge$lambda.min) # optimal lambda
## [1] 450.7435
pred.ridge <- predict(fit.ridge, s=lambda, newx=xmat.test)
(err.ridge <- mean((test$Apps - pred.ridge)^2)) # test error
## [1] 1037616
Part d)
require(glmnet)
xmat.train <- model.matrix(Apps~., data=train)[,-1]
xmat.test <- model.matrix(Apps~., data=test)[,-1]
fit.lasso <- cv.glmnet(xmat.train, train$Apps, alpha=1)
(lambda <- fit.lasso$lambda.min) # optimal lambda
## [1] 29.65591
pred.lasso <- predict(fit.lasso, s=lambda, newx=xmat.test)
(err.lasso <- mean((test$Apps - pred.lasso)^2)) # test error
## [1] 1025248
coef.lasso <- predict(fit.lasso, type="coefficients", s=lambda)[1:ncol(College),]
coef.lasso[coef.lasso != 0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -4.514223e+02 -4.814352e+02 1.535012e+00 -4.035455e-01 4.668170e+01
## Top25perc F.Undergrad Outstate Room.Board PhD
## -7.091364e+00 -3.426988e-03 -5.020064e-02 1.851610e-01 -3.849885e+00
## Terminal perc.alumni Expend Grad.Rate
## -3.443747e+00 -2.117870e+00 3.192846e-02 2.695730e+00
length(coef.lasso[coef.lasso != 0])
## [1] 14
Part e)
require(pls)
set.seed(1)
fit.pcr <- pcr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(fit.pcr, val.type="MSEP")
summary(fit.pcr)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4335 4179 2364 2374 1996 1844 1845
## adjCV 4335 4182 2360 2374 1788 1831 1838
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1850 1863 1809 1809 1812 1815 1825
## adjCV 1844 1857 1801 1800 1804 1808 1817
## 14 comps 15 comps 16 comps 17 comps
## CV 1810 1823 1273 1281
## adjCV 1806 1789 1260 1268
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 31.216 57.68 64.73 70.55 76.33 81.30 85.01
## Apps 6.976 71.47 71.58 83.32 83.44 83.45 83.46
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 88.40 91.16 93.36 95.38 96.94 97.96 98.76
## Apps 83.47 84.53 84.86 84.98 84.98 84.99 85.24
## 15 comps 16 comps 17 comps
## X 99.40 99.87 100.00
## Apps 90.87 93.93 93.97
pred.pcr <- predict(fit.pcr, test, ncomp=16) # min Cv at M=16
(err.pcr <- mean((test$Apps - pred.pcr)^2)) # test error
## [1] 1166897
Part f)
require(pls)
set.seed(1)
fit.pls <- plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(fit.pls, val.type="MSEP")
summary(fit.pls)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4335 2176 1893 1725 1613 1406 1312
## adjCV 4335 2171 1884 1715 1578 1375 1295
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1297 1285 1280 1278 1279 1282 1281
## adjCV 1281 1271 1267 1265 1266 1269 1268
## 14 comps 15 comps 16 comps 17 comps
## CV 1281 1281 1281 1281
## adjCV 1267 1267 1268 1268
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 26.91 43.08 63.26 65.16 68.50 73.75 76.10
## Apps 76.64 83.93 87.14 91.90 93.49 93.85 93.91
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 79.03 81.76 85.41 89.03 91.38 93.31 95.43
## Apps 93.94 93.96 93.96 93.96 93.97 93.97 93.97
## 15 comps 16 comps 17 comps
## X 97.41 98.78 100.00
## Apps 93.97 93.97 93.97
pred.pls <- predict(fit.pls, test, ncomp=10) # min Cv at M=10
(err.pls <- mean((test$Apps - pred.pls)^2)) # test error
## [1] 1134531
Part g)
err.all <- c(err.lm, err.ridge, err.lasso, err.pcr, err.pls)
names(err.all) <- c("lm", "ridge", "lasso", "pcr", "pls")
barplot(err.all )
The test errors aren’t much different. The ridge and lasso seem to perform slightly better while the PCR/PLS don’t show any improvement from the full linear regression model.
plot(test$Apps, pred.lm)
EXERCISE 10:
Part a)
set.seed(1)
eps <- rnorm(1000)
xmat <- matrix(rnorm(1000*20), ncol=20)
betas <- sample(-5:5, 20, replace=TRUE)
betas[c(3,6,7,10,13,17)] <- 0
betas
## [1] 2 3 0 5 1 0 0 -3 3 0 4 3 0 -4 -4 -2 0 0 -5 2
y <- xmat %*% betas + eps
Part b)
set.seed(1)
trainid <- sample(1:1000, 100, replace=FALSE)
xmat.train <- xmat[trainid,]
xmat.test <- xmat[-trainid,]
y.train <- y[trainid,]
y.test <- y[-trainid,]
train <- data.frame(y=y.train, xmat.train)
test <- data.frame(y=y.test, xmat.test)
Part c)
require(leaps)
# predict function from chapter 6 labs
predict.regsubsets <- function(object, newdata, id, ...){
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id=id)
xvars <- names(coefi)
mat[,xvars]%*%coefi
}
regfit.full <- regsubsets(y~., data=train, nvmax=20)
err.full <- rep(NA, 20)
for(i in 1:20) {
pred.full <- predict(regfit.full, train, id=i)
err.full[i] <- mean((train$y - pred.full)^2)
}
plot(1:20, err.full, type="b", main="Training MSE", xlab="Number of Predictors")
which.min(err.full) # min for train error should be at max pred count
## [1] 20
Part d)
err.full <- rep(NA, 20)
for(i in 1:20) {
pred.full <- predict(regfit.full, test, id=i)
err.full[i] <- mean((test$y - pred.full)^2)
}
plot(1:20, err.full, type="b", main="Test MSE", xlab="Number of Predictors")
Part e)
which.min(err.full) # optimal number of predictors from best subset
## [1] 13
Part f)
(coef.best <- coef(regfit.full, id=which.min(err.full)))
## (Intercept) X1 X2 X4 X5 X8
## -0.1732881 2.0010181 3.0865262 4.9303510 0.9812254 -2.9307495
## X9 X11 X12 X14 X15 X16
## 3.0238521 3.8272702 3.0607539 -3.9277547 -4.0459157 -1.7658311
## X19 X20
## -4.7888341 2.2134924
betas[betas != 0]
## [1] 2 3 5 1 -3 3 4 3 -4 -4 -2 -5 2
names(betas) <- paste0("X", 1:20)
merge(data.frame(beta=names(betas),betas), data.frame(beta=names(coef.best),coef.best), all.x=T, sort=F)
## beta betas coef.best
## 1 X1 2 2.0010181
## 2 X2 3 3.0865262
## 3 X4 5 4.9303510
## 4 X5 1 0.9812254
## 5 X8 -3 -2.9307495
## 6 X9 3 3.0238521
## 7 X11 4 3.8272702
## 8 X12 3 3.0607539
## 9 X14 -4 -3.9277547
## 10 X15 -4 -4.0459157
## 11 X16 -2 -1.7658311
## 12 X19 -5 -4.7888341
## 13 X20 2 2.2134924
## 14 X13 0 NA
## 15 X6 0 NA
## 16 X3 0 NA
## 17 X17 0 NA
## 18 X10 0 NA
## 19 X7 0 NA
## 20 X18 0 NA
The best subset model selected all the correct predictors
Part g)
err.coef <- rep(NA, 20)
for(i in 1:20) {
coef.i <- coef(regfit.full, id=i)
df.err <- merge(data.frame(beta=names(betas),betas), data.frame(beta=names(coef.i),coef.i), all.x=T)
df.err[is.na(df.err[,3]),3] <- 0
err.coef[i] <- sqrt(sum((df.err[,2] - df.err[,3])^2))
}
plot(1:20, err.coef, type="b", main="Coefficient Error", xlab="Number of Predictors")
points(which.min(err.coef), err.coef[which.min(err.coef)], col="red", pch=16)
The coefficient error plot shows a very similar plot to the test error plot
EXERCISE 11:
Part a)
require(leaps) # forward and backward selection
require(glmnet) # ridge and lasso
require(MASS) # Boston data set
data(Boston)
# split data into training and test sets
set.seed(1)
trainid <- sample(1:nrow(Boston), nrow(Boston)/2)
train <- Boston[trainid,]
test <- Boston[-trainid,]
xmat.train <- model.matrix(crim~., data=train)[,-1]
xmat.test <- model.matrix(crim~., data=test)[,-1]
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# ridge regression model
fit.ridge <- cv.glmnet(xmat.train, train$crim, alpha=0)
(lambda <- fit.ridge$lambda.min) # optimal lambda
## [1] 0.5982585
pred.ridge <- predict(fit.ridge, s=lambda, newx=xmat.test)
(err.ridge <- mean((test$crim - pred.ridge)^2)) # test error
## [1] 38.36353
predict(fit.ridge, s=lambda, type="coefficients")
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 4.429222285
## zn 0.036521710
## indus -0.061214283
## chas -0.775621731
## nox -8.045252067
## rm 1.179585530
## age 0.005683303
## dis -0.884805550
## rad 0.429755276
## tax 0.003261068
## ptratio -0.169564880
## black -0.004036776
## lstat 0.193028502
## medv -0.171441090
# lasso regression model
fit.lasso <- cv.glmnet(xmat.train, train$crim, alpha=1)
(lambda <- fit.lasso$lambda.min) # optimal lambda
## [1] 0.2530181
pred.lasso <- predict(fit.lasso, s=lambda, newx=xmat.test)
(err.lasso <- mean((test$crim - pred.lasso)^2)) # test error
## [1] 38.4593
predict(fit.lasso, s=lambda, type="coefficients")
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.803083100
## zn 0.017767800
## indus .
## chas -0.266407052
## nox .
## rm 0.510067274
## age .
## dis -0.377582057
## rad 0.457627436
## tax .
## ptratio .
## black -0.002413684
## lstat 0.159515883
## medv -0.093917991
# predict function from chapter 6 labs
predict.regsubsets <- function(object, newdata, id, ...){
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id=id)
xvars <- names(coefi)
mat[,xvars]%*%coefi
}
# forward selection
fit.fwd <- regsubsets(crim~., data=train, nvmax=ncol(Boston)-1)
(fwd.summary <- summary(fit.fwd))
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train, nvmax = ncol(Boston) -
## 1)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " "*" " " " " "*" " " " " " " "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " " " " " " " "*"
## 7 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " " " " " "*" "*"
## 8 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " "*" " " "*" "*"
## 9 ( 1 ) "*" " " " " "*" "*" " " "*" "*" "*" "*" " " "*" "*"
## 10 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" "*" "*" " " "*" "*"
## 11 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
err.fwd <- rep(NA, ncol(Boston)-1)
for(i in 1:(ncol(Boston)-1)) {
pred.fwd <- predict(fit.fwd, test, id=i)
err.fwd[i] <- mean((test$crim - pred.fwd)^2)
}
plot(err.fwd, type="b", main="Test MSE for Forward Selection", xlab="Number of Predictors")
which.min(err.fwd)
## [1] 4
# backward selection
fit.bwd <- regsubsets(crim~., data=train, nvmax=ncol(Boston)-1)
(bwd.summary <- summary(fit.bwd))
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train, nvmax = ncol(Boston) -
## 1)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " "*" " " " " "*" " " " " " " "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " " " " " " " "*"
## 7 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " " " " " "*" "*"
## 8 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " "*" " " "*" "*"
## 9 ( 1 ) "*" " " " " "*" "*" " " "*" "*" "*" "*" " " "*" "*"
## 10 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" "*" "*" " " "*" "*"
## 11 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
err.bwd <- rep(NA, ncol(Boston)-1)
for(i in 1:(ncol(Boston)-1)) {
pred.bwd <- predict(fit.bwd, test, id=i)
err.bwd[i] <- mean((test$crim - pred.bwd)^2)
}
plot(err.bwd, type="b", main="Test MSE for Backward Selection", xlab="Number of Predictors")
which.min(err.bwd)
## [1] 4
par(mfrow=c(3,2))
min.cp <- which.min(fwd.summary$cp)
plot(fwd.summary$cp, xlab="Number of Poly(X)", ylab="Forward Selection Cp", type="l")
points(min.cp, fwd.summary$cp[min.cp], col="red", pch=4, lwd=5)
min.cp <- which.min(bwd.summary$cp)
plot(bwd.summary$cp, xlab="Number of Poly(X)", ylab="Backward Selection Cp", type="l")
points(min.cp, bwd.summary$cp[min.cp], col="red", pch=4, lwd=5)
min.bic <- which.min(fwd.summary$bic)
plot(fwd.summary$bic, xlab="Number of Poly(X)", ylab="Forward Selection BIC", type="l")
points(min.bic, fwd.summary$bic[min.bic], col="red", pch=4, lwd=5)
min.bic <- which.min(bwd.summary$bic)
plot(bwd.summary$bic, xlab="Number of Poly(X)", ylab="Backward Selection BIC", type="l")
points(min.bic, bwd.summary$bic[min.bic], col="red", pch=4, lwd=5)
min.adjr2 <- which.max(fwd.summary$adjr2)
plot(fwd.summary$adjr2, xlab="Number of Poly(X)", ylab="Forward Selection Adjusted R^2", type="l")
points(min.adjr2, fwd.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5)
min.adjr2 <- which.max(bwd.summary$adjr2)
plot(bwd.summary$adjr2, xlab="Number of Poly(X)", ylab="Backward Selection Adjusted R^2", type="l")
points(min.adjr2, bwd.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5)
Part b)
err.ridge
## [1] 38.36353
err.lasso
## [1] 38.4593
err.fwd
## [1] 41.20712 39.46705 40.40271 38.96427 39.28537 40.02194 39.63462
## [8] 39.56733 39.64198 39.59028 39.27845 39.33711 39.27592
err.bwd
## [1] 41.20712 39.46705 40.40271 38.96427 39.28537 40.02194 39.63462
## [8] 39.56733 39.64198 39.59028 39.27845 39.33711 39.27592
Probably choose the lasso model because its test MSE is close to best and eliminates some predictors to reduce model complexity
Part c)
No because not all the predictors add much value to the model