EXERCISE 1:
\[ Var(\alpha X + (1-\alpha)Y) \\ = Var(\alpha X) + Var((1-\alpha)Y) +2 Cov(\alpha X, (1-\alpha)Y) \\ = \alpha^2 \sigma_X^2 + (1-\alpha)^2 \sigma_Y^2 + 2 \alpha (1-\alpha) \sigma_{XY} \\ = \alpha^2 \sigma_X^2 + (1+\alpha^2-2\alpha) \sigma_Y^2 + (2\alpha - 2\alpha^2) \sigma_{XY} \\ = \alpha^2 \sigma_X^2 + \sigma_Y^2+\alpha^2\sigma_Y^2-2\alpha\sigma_Y^2 + 2\alpha \sigma_{XY} - 2\alpha^2 \sigma_{XY} \]
\[ \frac{\partial }{\partial \alpha}: 2\alpha\sigma_X^2 + 0 + 2\alpha\sigma_Y^2 - 2\sigma_Y^2 + 2\sigma_{XY} - 4\alpha\sigma_{XY} = 0 \]
\[ (2\sigma_X^2 + 2\sigma_Y^2 - 4\sigma_{XY}) \alpha = 2\sigma_Y^2 - 2\sigma_{XY} \]
\[ \alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2 - 2\sigma_{XY}} \]
EXERCISE 2:
Part a)
Probability is equal to not selecting that one observation out of all observations: \(\frac{n-1}{n}\)
Part b)
Because bootstrap uses replacement, the probability is the same as Part a: \(\frac{n-1}{n}\)
Part c)
Probability of not selecting the jth observation is the same for each selection. After \(n\) selections, the probability of never selecting the jth observation is: \((\frac{n-1}{n})^n = (1-\frac{1}{n})^n\)
Part d)
1-(1-1/5)^5
## [1] 0.67232
Part e)
1-(1-1/100)^100
## [1] 0.6339677
Part f)
1-(1-1/10000)^10000
## [1] 0.632139
Part g)
plot(1:10000, 1-(1-1/1:10000)^(1:10000))
Probability pretty quickly reaches mid-60%
Part h)
store <- rep(NA, 10000)
for (i in 1:10000)
store[i] <- sum(sample(1:100, rep=TRUE)==4) > 0
mean(store)
## [1] 0.6317
The resulting fraction of 10,000 bootstrap samples that have the 4th observation is close to our predicted probability of 1-(1-1/100)^100 = 63.4%
EXERCISE 3:
Part a)
From page 181 in the text, the k-fold CV approach “involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k-1 folds. The mean squared error, MSE, is then computed on the observations in the held-out fold. This procedure is repeated k times.”
Part b)
EXERCISE 4:
We can use the bootstrap method to sample with replacement from our dataset and estimate Y’s from each sample. With the results of different predicted Y values, we can then estimate the standard deviation of our prediction.
EXERCISE 5:
Part a)
require(ISLR)
data(Default)
set.seed(1)
fit1 <- glm(default ~ income + balance, data=Default, family=binomial)
summary(fit1)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Part b)
set.seed(1)
train <- sample(nrow(Default), nrow(Default)*0.5)
fit2 <- glm(default ~ income + balance, data=Default, family=binomial, subset=train)
prob2 <- predict(fit2, Default[-train,], type="response")
pred2 <- ifelse(prob2 > 0.5, "Yes", "No")
table(pred2, Default[-train,]$default)
##
## pred2 No Yes
## No 4805 115
## Yes 28 52
mean(Default[-train,]$default != pred2) # test error
## [1] 0.0286
Part c)
set.seed(2) # Repeat 1
train <- sample(nrow(Default), nrow(Default)*0.5)
fit2 <- glm(default ~ income + balance, data=Default, family=binomial, subset=train)
prob2 <- predict(fit2, Default[-train,], type="response")
pred2 <- ifelse(prob2 > 0.5, "Yes", "No")
mean(Default[-train,]$default != pred2) # test error
## [1] 0.0276
set.seed(3) # Repeat 2
train <- sample(nrow(Default), nrow(Default)*0.5)
fit2 <- glm(default ~ income + balance, data=Default, family=binomial, subset=train)
prob2 <- predict(fit2, Default[-train,], type="response")
pred2 <- ifelse(prob2 > 0.5, "Yes", "No")
mean(Default[-train,]$default != pred2) # test error
## [1] 0.0248
set.seed(4) # Repeat 3
train <- sample(nrow(Default), nrow(Default)*0.5)
fit2 <- glm(default ~ income + balance, data=Default, family=binomial, subset=train)
prob2 <- predict(fit2, Default[-train,], type="response")
pred2 <- ifelse(prob2 > 0.5, "Yes", "No")
mean(Default[-train,]$default != pred2) # test error
## [1] 0.0262
The test error seems consistent around 2.5% (variance is not large)
Part d)
set.seed(1)
train <- sample(nrow(Default), nrow(Default)*0.5)
fit3 <- glm(default ~ income + balance + student, data=Default, family=binomial, subset=train)
prob3 <- predict(fit3, Default[-train,], type="response")
pred3 <- ifelse(prob3 > 0.5, "Yes", "No")
mean(Default[-train,]$default != pred3) # test error
## [1] 0.0288
Test error with the student
feature included is similar to without including student
(no significant reduction)
EXERCISE 6:
Part a)
require(ISLR)
data(Default)
set.seed(1)
fit1 <- glm(default ~ income + balance, data=Default, family=binomial)
summary(fit1)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Estimated standard error is 0.000004985 for income
and 0.0002274 for balance
Part b)
set.seed(1)
boot.fn <- function(df, trainid) {
return(coef(glm(default ~ income + balance, data=df, family=binomial, subset=trainid)))
}
boot.fn(Default, 1:nrow(Default)) # check match with summary
## (Intercept) income balance
## -1.154047e+01 2.080898e-05 5.647103e-03
Part c)
require(boot)
## Loading required package: boot
boot(Default, boot.fn, R=100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 9.699111e-02 4.101121e-01
## t2* 2.080898e-05 6.715005e-08 4.127740e-06
## t3* 5.647103e-03 -5.733883e-05 2.105660e-04
Part d)
Standard error estimates are pretty close using glm summary function versus bootstrap with R=100
income
: 4.985e-06 with glm summary, 4.128e-06 using bootstrapbalance
: 2.274e-04 with glm summary, 2.106e-04 using bootstrapEXERCISE 7:
Part a)
require(ISLR)
data(Weekly)
set.seed(1)
fit1 <- glm(Direction ~ Lag1 + Lag2, data=Weekly, family=binomial)
summary(fit1)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.623 -1.261 1.001 1.083 1.506
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22122 0.06147 3.599 0.000319 ***
## Lag1 -0.03872 0.02622 -1.477 0.139672
## Lag2 0.06025 0.02655 2.270 0.023232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1488.2 on 1086 degrees of freedom
## AIC: 1494.2
##
## Number of Fisher Scoring iterations: 4
Part b)
set.seed(1)
fit2 <- glm(Direction ~ Lag1 + Lag2, data=Weekly, family=binomial, subset=2:nrow(Weekly))
summary(fit2)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly,
## subset = 2:nrow(Weekly))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6258 -1.2617 0.9999 1.0819 1.5071
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22324 0.06150 3.630 0.000283 ***
## Lag1 -0.03843 0.02622 -1.466 0.142683
## Lag2 0.06085 0.02656 2.291 0.021971 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1494.6 on 1087 degrees of freedom
## Residual deviance: 1486.5 on 1085 degrees of freedom
## AIC: 1492.5
##
## Number of Fisher Scoring iterations: 4
Part c)
ifelse(predict(fit2, Weekly[1,], type="response")>0.5, "Up", "Down")
## 1
## "Up"
Weekly[1,]$Direction
## [1] Down
## Levels: Down Up
The first observation was incorrectly classified (predicted Up, actually Down)
Part d)
set.seed(1)
loocv.err <- rep(0,nrow(Weekly))
for (i in 1:nrow(Weekly)) {
myfit <- glm(Direction ~ Lag1 + Lag2, data=Weekly[-i,], family=binomial)
mypred <- ifelse(predict(myfit, Weekly[1,], type="response")>0.5, "Up", "Down")
loocv.err[i] <- ifelse(Weekly[i,]$Direction==mypred, 0, 1)
}
str(loocv.err)
## num [1:1089] 1 1 0 0 0 1 0 0 0 1 ...
Part e)
mean(loocv.err)
## [1] 0.4444444
Estimated test error with LOOCV is 44.4%
EXERCISE 8:
Part a)
set.seed(1)
y <- rnorm(100) # why is this needed?
x <- rnorm(100)
y <- x - 2*x^2 + rnorm(100)
\(Y = X - 2X^2 + \epsilon\)
\(n = 100\) observations
\(p = 2\) features
Part b)
plot(x, y)
Relationship between X and Y is quadratic
Part c)
set.seed(1)
df <- data.frame(y, x, x2=x^2, x3=x^3, x4=x^4)
fit1 <- glm(y ~ x, data=df)
cv.err1 <- cv.glm(df, fit1)
cv.err1$delta
## [1] 5.890979 5.888812
fit2 <- glm(y ~ x + x2, data=df)
cv.err2 <- cv.glm(df, fit2)
cv.err2$delta
## [1] 1.086596 1.086326
fit3 <- glm(y ~ x + x2 + x3, data=df)
cv.err3 <- cv.glm(df, fit3)
cv.err3$delta
## [1] 1.102585 1.102227
fit4 <- glm(y ~ x + x2 + x3 + x4, data=df)
cv.err4 <- cv.glm(df, fit4)
cv.err4$delta
## [1] 1.114772 1.114334
Part d)
set.seed(2)
df <- data.frame(y, x, x2=x^2, x3=x^3, x4=x^4)
fit1 <- glm(y ~ x, data=df)
cv.err1 <- cv.glm(df, fit1)
cv.err1$delta
## [1] 5.890979 5.888812
fit2 <- glm(y ~ x + x2, data=df)
cv.err2 <- cv.glm(df, fit2)
cv.err2$delta
## [1] 1.086596 1.086326
fit3 <- glm(y ~ x + x2 + x3, data=df)
cv.err3 <- cv.glm(df, fit3)
cv.err3$delta
## [1] 1.102585 1.102227
fit4 <- glm(y ~ x + x2 + x3 + x4, data=df)
cv.err4 <- cv.glm(df, fit4)
cv.err4$delta
## [1] 1.114772 1.114334
Results are exactly the same because LOOCV predicts every observation using the all of the rest (no randomness involved)
Part e)
The quadratic model using \(X\) and \(X^2\) had the lowest error. This makes sense because the true model was generated using a quadratic formula
Part f)
fit0 <- lm(y ~ poly(x,4))
summary(fit0)
##
## Call:
## lm(formula = y ~ poly(x, 4))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8914 -0.5244 0.0749 0.5932 2.7796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.8277 0.1041 -17.549 <2e-16 ***
## poly(x, 4)1 2.3164 1.0415 2.224 0.0285 *
## poly(x, 4)2 -21.0586 1.0415 -20.220 <2e-16 ***
## poly(x, 4)3 -0.3048 1.0415 -0.293 0.7704
## poly(x, 4)4 -0.4926 1.0415 -0.473 0.6373
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.041 on 95 degrees of freedom
## Multiple R-squared: 0.8134, Adjusted R-squared: 0.8055
## F-statistic: 103.5 on 4 and 95 DF, p-value: < 2.2e-16
Summary shows that only \(X\) and \(X^2\) are statistically significant predictors. This agrees with the LOOCV results that indicate using only \(X\) and \(X^2\) produces the best model.
EXERCISE 9:
Part a)
require(MASS)
require(boot)
data(Boston)
(medv.mu <- mean(Boston$medv))
## [1] 22.53281
Part b)
(medv.sd <- sd(Boston$medv)/sqrt(nrow(Boston)))
## [1] 0.4088611
Part c)
set.seed(1)
mean.fn <- function(var, id) {
return(mean(var[id]))
}
(boot.res <- boot(Boston$medv, mean.fn, R=100))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = mean.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.04836957 0.3815554
Estimation from bootstrap with R=100 is 0.38, reasonably close to 0.41
Part d)
boot.res$t0 - 2*sd(boot.res$t) # lower bound
## [1] 21.7697
boot.res$t0 + 2*sd(boot.res$t) # upper bound
## [1] 23.29592
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
Part e)
(medv.median <- median(Boston$medv))
## [1] 21.2
Part f)
set.seed(1)
median.fn <- function(var, id) {
return(median(var[id]))
}
(boot.res <- boot(Boston$medv, median.fn, R=100))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = median.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.026 0.3683488
Estimated standard error is 0.368
Part g)
(medv.mu10 <- quantile(Boston$medv, 0.1))
## 10%
## 12.75
Part h)
set.seed(1)
quantile10.fn <- function(var, id) {
return(quantile(var[id], 0.1))
}
(boot.res <- boot(Boston$medv, quantile10.fn, R=100))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = quantile10.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 -0.0985 0.4988094
Estimated standard error is 0.499