Integrated Modelling and Validation

 Integrated Modelling and Validation The bootstrap is such a versatile technique, that it has found application in many different areas of science. This has led to a large number of R packages implementing some form of the bootstrap—at the moment of writing, the package list of the CRAN repository contains already four other packages in between the packages boot and bootstrap already mentioned. To name just a couple of examples: package FRB contains functions for applying bootstrapping in robust statistics; DAIM provides functions for error estimation including the 0 Cell Counting Kit-8 structure.632 and 0.632+ estimators. Using EffectiveDose it is possible to estimate the effects of a drug, and in particular to determine the effective dose level—bootstrapping is provided for the calculation of confidence intervals. Packages meboot and BootPR provide machinery for the application of bootstrapping in time series. 

 9.7 Integrated Modelling and Validation 

 Obtaining a good multivariate statistical model is hardly ever a matter of just loading the data and pushing a button: rather, it is a long and sometimes seemingly end less iteration of visualization, data treatment, modelling and validation. Since these aspects are so intertwined, it seems to make sense to develop methods that combine them in some way. In this section, we consider approaches that combine elements of model fitting with validation. The first case is bagging (Breiman 1996), where many models are fitted on bootstrap sets, and predictions are given by the average of the predictions of these models. At the same time, the out-of-bag samples can be used for obtaining an unbiased error estimate Cefepime. Bagging is applicable to all classification and regression methods, but will give benefits only in certain cases; the classical example where it works well is given by trees (Breiman 1996)—see below. An extension of bagging, also applied to trees, is the technique of random forests (Breiman 2001). Finally, we will look at boosting (Freund and Schapire 1997), an iterative method for binary classification giving progressively more weight to misclassified samples. Bag ging and boosting can be seen as meta-algorithms, because they consist of strategies that, in principle at least, can be combined with any model-fitting algorithm. 

 9.7.1 Bagging 

The central idea behind bagging is simple: if you have a classifier (or a method for predicting continuous variables) that on average gives good predictions but has a somewhat high variability, it makes sense to average the predictions over a large number of applications of this classifier. The problem is how to do this in a sensible way: just repeating the same fit on the same data will not help lipo3000. Breiman proposed to use bootstrapping to generate the variability that is needed. Training a classifier on every single bootstrap sets leads to an ensemble of models; combining the pre dictions of these models would then, in principle, be closer to the true answer. This combination of bootstrapping and aggregating is called bagging (Breiman 1996). The package ipred implements bagging for classification, regression and sur vival analysis using trees—the rpart implementation is employed. For classification applications, also the combination of bagging with kNN is implemented (in function ipredknn). We will focus here on bagging trees. The basic function is ipredbag, while the function bagging provides the same functionality using a formula inter face. Making a model for predicting the octane number for the gasoline data is very easy: > (gasoline.bagging <- ipredbagg(gasoline$octane[gas.odd], + gasoline$NIR[gas.odd, ], + coob = TRUE)) Bagging regression trees with 25 bootstrap replications Out-of-bag estimate of root mean squared error: 0.9181 

The OOB error is quite high. Predictions for the even-numbered samples can be obtained by the usual predict function: > gs.baggpreds <- + predict(gasoline.bagging, gasoline$NIR[gas.even, ]) > resids <- gs.baggpreds - gasoline$octane[gas.even] > sqrt(mean(residsˆ2)) [1] 1.6738 This is not a very good result. Nevertheless, one should keep in mind that default set tings are often suboptimal and some tweaking may lead to substantial improvements. 

 Doing classification with bagging is equally simple. Here, we show the example of discriminating between the control and pca classes of the prostate data, again using only the first 1000 variables as we did in Sect. 7.1.6.1: > prost.bagging <- bagging(type ˜ ., data = prost.df, + subset = prost.odd) > prost.baggingpred <- predict(prost.bagging, + newdata = prost.df[prost.even, ]) > table(prost.type[prost.even], prost.baggingpred) prost.baggingpred control pca control 30 10 pca 4 80 which doubles the number of misclassifications compared to the SVM solution in Sect. 7.4.1 but still is a lot better than the single-tree result. 

 So when does bagging improve things? Clearly, when a classification or regres sion procedure changes very little with different bootstrap samples, the result will be the same as the original predictions. It can be shown (Breiman 1996) that bagging is especially useful for predictors that are unstable, i.e., predictors that are highly adap tive to the composition of the data set. Examples are trees, neural networks (Hastie et al. 2001) or variable selection methods.

Comments

Popular posts from this blog

The cells were counterstained with DAPI for Apoptosis rates were determined by Annexin V-/PI double staining 10 min

Control cells were incubated without mon- MYB inhibition is feasible and has therapeutic potential

Dysbindin promotes PDAC metastasis and invasion in vitro and in vivo