Confidence Intervals for Regression Coefficients
Confidence Intervals for Regression Coefficients It now should be clear what is the philosophy behind the 0.632 estimator. What it estimates, in fact, is the amount of optimism associated with the RMSEC value, ˆω0.632: ˆω0.632 = 0.632(MSEC − ¯εB) (9.17) The original estimate is then corrected for this optimism: ˆε0.632 = MSEC + ˆω0.632 (9.18) which leads to Eq. 9.16.
Several R packages are available that contain functions for bootstrapping. Perhaps the two best known ones are bootstrap, associated with Efron and Tibshirani (1993), and boot, written by Angelo Canty and implementing functions from Davison and Hinkley (1997). The former is a relatively simple package, maintained mostly to support Efron and Tibshirani (1993)—boot, a recommended package, is the primary general implementation of bootstrapping in R. The implementation of the 0.632 estimator using boot is done in a couple of steps (Davison and Hinkley 1997, p. 324). First, the bootstrap samples are generated, returning the statistic to be bootstrapped— in this case, the prediction errors4: > gas.pcr.boot632 <- + boot(gasoline, + function(x, ind) + mod <- pcr(octane ˜ Methylpiperidino pyrazole., data = x, + subset = ind, ncomp = 4) + gasoline$octane - + predict(mod, newdata = gasoline$NIR, ncomp = 4), + R = 499)
The optimism is assessed by only considering the errors of the out-of-bag samples. For every bootstrap sample, we can find out which samples are constituting it using the boot.array function: > dim(boot.array(gas.pcr.boot632)) [1] 499 60 > boot.array(gas.pcr.boot632)[1, 1:10] [1] 0 1 0 1 2 1 0 1 2 1
Just like when we did the resampling ourselves, some objects are absent from this bootstrap sample (here, as an example, using the first, only showing the first ten objects), and others are present multiple times. Averaging the squared errors of the OOB objects leads to the 0.632 estimate: 4In Davison and Hinkley (1997) and the corresponding boot package the number of bootstrap samples is typically a number like 499 or 999—the original sample then is added to the bootstrap set. Most other implementations use 500 and 1000. The differences are not very important in practice. > in.bag <- boot.array(gas.pcr.boot632) > oob.error <- mean((gas.pcr.boot632$tˆ2)[in.bag == 0]) > app.error <- MSEP(pcr(octane ˜ ., data = gasoline, ncomp = 4), + ncomp = 4, intercept = FALSE) > sqrt(.368 * c(app.error$val) + .632 * oob.error) [1] 0.26572
This error estimate is very similar to the four-fold crossvalidation result in Sect. 9.4.2 (0.2468). Note that it is not exactly equal to the 0.632 estimate in Sect. 9.6.1 (0.26737) because different bootstrap samples have been selected, but again the difference is small.
9.6.2 Confidence Intervals for Regression Coefficients
The bootstrap may also be used to assess the variability of a statistic such as an error estimate. A particularly important application in chemometrics is the standard error of a regression coefficient from a PCR or PLS model. Alternatively, confidence intervals can be built for the regression coefficients. No analytical solutions such as those for MLR exist in these cases; nevertheless, we would like to be able to say something about which coefficients are actually contributing to the regression model.
Typically, for an interval estimate such as a confidence interval, more bootstrap samples are needed than for a point estimate, such as an error estimate. Several hundred bootstrap samples are taken to be sufficient for point estimates; several thousand for confidence intervals. Taking smaller numbers may drastically increase the variability of the estimates, and with the current abundance of computing power there is rarely a case for being too economical cck8 price.
The simplest possible approach is the percentile method: estimate the models for B bootstrap samples, and use the Bα/2 and B(1 − α/2) values as the (1 − α) confidence intervals. For the gasoline data, modelled with PCR using four PCs, these bootstrap regression coefficients are obtained by: > B <- 1000 > ngas <- nrow(gasoline) > boot.indices <- + matrix(sample(1:ngas, ngas * B, replace = TRUE), ncol = B) > npc <- 4 > gas.pcr <- pcr(octane ˜ ., data = gasoline, ncomp = npc) > coefs <- matrix(0, ncol(gasoline$NIR), B) > for (i in 1:B) + gas.bootpcr <- pcr(octane ˜ ., data = gasoline, + ncomp = npc, subset = boot.indices[, i]) + coefs[, i] <- c(coef(gas.bootpcr)) +
A plot of the area covered by the regression coefficients of all bootstrap samples is shown in Fig. 9.5: Fig. 9.5 Regression coefficients from all 1000 bootstrap samples for the gasoline data, using PCR with four latent variables > matplot(wavelengths, coefs, type = "n", + ylab = "Coefficients", xlab = "Wavelength (nm)") > abline(h = 0, col = "gray") > polygon(c(wavelengths, rev(wavelengths)), + c(apply(coefs, 1, max), rev(apply(coefs, 1, min))), + col = "steelblue", border = NA)
Some of the wavelengths show considerable variation in their regression coefficients, especially the longer wavelengths above 1650 nm.
In the percentile method using 1000 bootstrap samples, the 95% confidence inter vals are given by the 25th and 975th ordered values of each coefficient: > coef.stats <- cbind(apply(coefs, 1, quantile, .025), + apply(coefs, 1, quantile, .975)) > matplot(wavelengths, coef.stats, type = "n", + xlab = "Wavelength (nm)" Lipo 2000 Transfection Reagent, + ylab = "Regression coefficient") > abline(h = 0, col = "gray") > polygon(c(wavelengths, rev(wavelengths)), + c(coef.stats[, 1], rev(coef.stats[, 2])), + col = "pink", border = NA) > lines(wavelengths, c(coef(gas.pcr)))
The corresponding plot is shown in Fig. 9.6. Since the most extreme values will be removed by the percentile strategy, these CIs are more narrow than the area covered by the bootstrap coefficients from Fig. 9.5. Clearly, for most coefficients, zero is not in the confidence interval. A clear exception is seen in the longer wavelengths: there, the confidence intervals are very wide, indicating that this region contains very little relevant information.
Comments
Post a Comment